| name | model-serving |
| version | 2.0.0 |
| sasmp_version | 1.3.0 |
| description | Master model serving - inference optimization, scaling, deployment, edge serving |
| bonded_agent | 05-model-serving |
| bond_type | PRIMARY_BOND |
| category | deployment |
| difficulty | intermediate_to_advanced |
| estimated_hours | 35 |
| prerequisites | mlops-basics, training-pipelines |
| validation | [object Object] |
| observability | [object Object] |
Model Serving Skill
Learn: Deploy ML models for production inference with optimization.
Skill Overview
| Attribute | Value |
|---|---|
| Bonded Agent | 05-model-serving |
| Difficulty | Intermediate to Advanced |
| Duration | 35 hours |
| Prerequisites | mlops-basics, training-pipelines |
Learning Objectives
- Deploy models with BentoML and Triton
- Optimize inference with quantization and ONNX
- Configure auto-scaling policies
- Implement batch and streaming inference
- Deploy to edge devices
Topics Covered
Module 1: Serving Platforms (8 hours)
Platform Comparison:
| Platform | Multi-framework | Dynamic Batching | Kubernetes |
|---|---|---|---|
| TorchServe | PyTorch only | ✅ | ✅ |
| Triton | ✅ | ✅ | ✅ |
| BentoML | ✅ | ✅ | ✅ |
| Seldon | ✅ | ⚠️ | ✅ |
Module 2: BentoML Deployment (10 hours)
Service Definition:
import bentoml
from bentoml.io import JSON, NumpyNdarray
@bentoml.service(resources={"gpu": 1, "memory": "4Gi"})
class ModelService:
def __init__(self):
self.model = bentoml.pytorch.load_model("model:latest")
@bentoml.api(route="/predict")
async def predict(self, input_array: np.ndarray) -> dict:
with torch.no_grad():
predictions = self.model(input_array)
return {"predictions": predictions.tolist()}
Exercises:
- Create BentoML service for your model
- Containerize and deploy to Kubernetes
- Configure traffic management
Module 3: Inference Optimization (10 hours)
Optimization Techniques:
# 1. Dynamic Quantization
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# 2. ONNX Export
torch.onnx.export(model, sample_input, "model.onnx")
# 3. TensorRT Conversion
import tensorrt as trt
# Convert ONNX to TensorRT for NVIDIA GPUs
Expected Speedups:
| Technique | Speedup | Accuracy Impact |
|---|---|---|
| FP16 | 2-3x | <1% |
| INT8 | 3-4x | 1-2% |
| TensorRT | 5-10x | <1% |
Module 4: Scaling & Monitoring (7 hours)
Kubernetes HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-serving
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Code Templates
Template: Production Serving
# templates/serving.py
from fastapi import FastAPI
import torch
import numpy as np
app = FastAPI()
class ProductionServer:
def __init__(self, model_path: str):
self.model = torch.jit.load(model_path)
self.model.eval()
def predict(self, inputs: np.ndarray) -> np.ndarray:
with torch.no_grad():
tensor = torch.from_numpy(inputs)
outputs = self.model(tensor)
return outputs.numpy()
server = ProductionServer("model.pt")
@app.post("/predict")
async def predict(data: dict):
inputs = np.array(data["inputs"])
predictions = server.predict(inputs)
return {"predictions": predictions.tolist()}
Troubleshooting Guide
| Issue | Cause | Solution |
|---|---|---|
| High latency | No optimization | Apply quantization, batching |
| Cold starts | Serverless | Pre-warming, min replicas |
| OOM | Model too large | Optimize, reduce batch size |
Resources
- BentoML Documentation
- Triton Inference Server
- ONNX Runtime
- [See: ml-monitoring] - Monitor deployed models
Version History
| Version | Date | Changes |
|---|---|---|
| 2.0.0 | 2024-12 | Production-grade with optimization |
| 1.0.0 | 2024-11 | Initial release |