name

ml-deployment-helper

description

Prepares ML models for production deployment with containerization, API creation, monitoring setup, and A/B testing. Activates for "deploy model", "production deployment", "model API", "containerize model", "docker ml", "serving ml model", "model monitoring", "A/B test model". Generates deployment artifacts and ensures models are production-ready with monitoring, versioning, and rollback capabilities.

ML Deployment Helper

Overview

Bridges the gap between trained models and production systems. Generates deployment artifacts, APIs, monitoring, and A/B testing infrastructure following MLOps best practices.

Deployment Checklist

Before deploying any model, this skill ensures:

✅ Model versioned and tracked
✅ Dependencies documented (requirements.txt/Dockerfile)
✅ API endpoint created
✅ Input validation implemented
✅ Monitoring configured
✅ A/B testing ready
✅ Rollback plan documented
✅ Performance benchmarked

Deployment Patterns

Pattern 1: REST API (FastAPI)

from specweave import create_model_api

# Generates production-ready API
api = create_model_api(
    model_path="models/model-v3.pkl",
    increment="0042",
    framework="fastapi"
)

# Creates:
# - api/
#   ├── main.py (FastAPI app)
#   ├── models.py (Pydantic schemas)
#   ├── predict.py (Prediction logic)
#   ├── Dockerfile
#   ├── requirements.txt
#   └── tests/

Generated main.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib

app = FastAPI(title="Recommendation Model API", version="0042-v3")

model = joblib.load("model-v3.pkl")

class PredictionRequest(BaseModel):
    user_id: int
    context: dict

@app.post("/predict")
async def predict(request: PredictionRequest):
    try:
        prediction = model.predict([request.dict()])
        return {
            "recommendations": prediction.tolist(),
            "model_version": "0042-v3",
            "timestamp": datetime.now()
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": model is not None}

Pattern 2: Batch Prediction

from specweave import create_batch_predictor

# For offline scoring
batch_predictor = create_batch_predictor(
    model_path="models/model-v3.pkl",
    increment="0042",
    input_path="s3://bucket/data/",
    output_path="s3://bucket/predictions/"
)

# Creates:
# - batch/
#   ├── predictor.py
#   ├── scheduler.yaml (Airflow/Kubernetes CronJob)
#   └── monitoring.py

Pattern 3: Real-Time Streaming

from specweave import create_streaming_predictor

# For Kafka/Kinesis streams
streaming = create_streaming_predictor(
    model_path="models/model-v3.pkl",
    increment="0042",
    input_topic="user-events",
    output_topic="predictions"
)

# Creates:
# - streaming/
#   ├── consumer.py
#   ├── predictor.py
#   ├── producer.py
#   └── docker-compose.yaml

Containerization

from specweave import containerize_model

# Generates optimized Dockerfile
dockerfile = containerize_model(
    model_path="models/model-v3.pkl",
    framework="sklearn",
    python_version="3.10",
    increment="0042"
)

Generated Dockerfile:

FROM python:3.10-slim

WORKDIR /app

# Copy model and dependencies
COPY models/model-v3.pkl /app/model.pkl
COPY requirements.txt /app/

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY api/ /app/api/

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8000/health || exit 1

# Run API
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Monitoring Setup

from specweave import setup_model_monitoring

# Configures monitoring for production
monitoring = setup_model_monitoring(
    model_name="recommendation-model",
    increment="0042",
    metrics=[
        "prediction_latency",
        "throughput",
        "error_rate",
        "prediction_distribution",
        "feature_drift"
    ]
)

# Creates:
# - monitoring/
#   ├── prometheus.yaml
#   ├── grafana-dashboard.json
#   ├── alerts.yaml
#   └── drift-detector.py

A/B Testing Infrastructure

from specweave import create_ab_test

# Sets up A/B test framework
ab_test = create_ab_test(
    control_model="model-v2.pkl",
    treatment_model="model-v3.pkl",
    traffic_split=0.1,  # 10% to new model
    success_metric="click_through_rate",
    increment="0042"
)

# Creates:
# - ab-test/
#   ├── router.py (traffic splitting)
#   ├── metrics.py (success tracking)
#   ├── statistical-tests.py (significance testing)
#   └── dashboard.py (real-time monitoring)

A/B Test Router:

import random

def route_prediction(user_id, control_model, treatment_model):
    """Route to control or treatment based on user_id hash"""
    
    # Consistent hashing (same user always gets same model)
    user_bucket = hash(user_id) % 100
    
    if user_bucket < 10:  # 10% to treatment
        return treatment_model.predict(features), "treatment"
    else:
        return control_model.predict(features), "control"

Model Versioning

from specweave import ModelVersion

# Register model version
version = ModelVersion.register(
    model_path="models/model-v3.pkl",
    increment="0042",
    metadata={
        "accuracy": 0.87,
        "training_date": "2024-01-15",
        "data_version": "v2024-01",
        "framework": "xgboost==1.7.0"
    }
)

# Easy rollback
if production_metrics["error_rate"] > threshold:
    ModelVersion.rollback(to_version="0042-v2")

Load Testing

from specweave import load_test_model

# Benchmark model performance
results = load_test_model(
    api_url="http://localhost:8000/predict",
    requests_per_second=[10, 50, 100, 500, 1000],
    duration_seconds=60,
    increment="0042"
)

Output:

Load Test Results:
==================

| RPS  | Latency P50 | Latency P95 | Latency P99 | Error Rate |
|------|-------------|-------------|-------------|------------|
| 10   | 35ms        | 45ms        | 50ms        | 0.00%      |
| 50   | 38ms        | 52ms        | 65ms        | 0.00%      |
| 100  | 45ms        | 70ms        | 95ms        | 0.02%      |
| 500  | 120ms       | 250ms       | 400ms       | 1.20%      |
| 1000 | 350ms       | 800ms       | 1200ms      | 8.50%      |

Recommendation: Deploy with max 100 RPS per instance
Target: <100ms P95 latency (achieved at 100 RPS)

Deployment Commands

# Generate deployment artifacts
/ml:deploy-prepare 0042

# Create API
/ml:create-api --increment 0042 --framework fastapi

# Setup monitoring
/ml:setup-monitoring 0042

# Create A/B test
/ml:create-ab-test --control v2 --treatment v3 --split 0.1

# Load test
/ml:load-test 0042 --rps 100 --duration 60s

# Deploy to production
/ml:deploy 0042 --environment production

Deployment Increment

The skill creates a deployment increment:

.specweave/increments/0043-deploy-recommendation-model/
├── spec.md (deployment requirements)
├── plan.md (deployment strategy)
├── tasks.md
│   ├── [ ] Containerize model
│   ├── [ ] Create API
│   ├── [ ] Setup monitoring
│   ├── [ ] Configure A/B test
│   ├── [ ] Load test
│   ├── [ ] Deploy to staging
│   ├── [ ] Validate staging
│   └── [ ] Deploy to production
├── api/ (FastAPI app)
├── monitoring/ (Grafana dashboards)
├── ab-test/ (A/B testing logic)
└── load-tests/ (Performance benchmarks)

Best Practices

Always load test before production
Start with 1-5% traffic in A/B test
Monitor model drift in production
Version everything (model, data, code)
Document rollback plan before deploying
Set up alerts for anomalies
Gradual rollout (canary deployment)

Integration with SpecWeave

# After training model (increment 0042)
/specweave:inc "0043-deploy-recommendation-model"

# Generates deployment increment with all artifacts
/specweave:do

# Deploy to production when ready
/ml:deploy 0043 --environment production

Model deployment is not the end—it's the beginning of the MLOps lifecycle.