ML System Design
This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.
When to Use This Skill
Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment
Use this skill when:
- Designing end-to-end ML systems for production
- Planning feature store architecture
- Designing model training pipelines
- Planning model serving infrastructure
- Preparing for ML system design interviews
- Evaluating ML platform tools and frameworks
ML System Architecture Overview
The ML System Lifecycle
┌─────────────────────────────────────────────────────────────────────────┐
│ ML SYSTEM LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │──▶│ Monitor│ │
│ │ Ingestion│ │ Pipeline │ │ Training │ │ Serving │ │ & Eval │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
│ │ Data │ │ Feature │ │ Model │ │ Inference│ │ Metrics│ │
│ │ Lake │ │ Store │ │ Registry │ │ Cache │ │ Store │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Key Components
| Component |
Purpose |
Examples |
| Data Ingestion |
Collect raw data from sources |
Kafka, Kinesis, Pub/Sub |
| Feature Pipeline |
Transform raw data to features |
Spark, Flink, dbt |
| Feature Store |
Store and serve features |
Feast, Tecton, Vertex AI |
| Model Training |
Train and validate models |
SageMaker, Vertex AI, Kubeflow |
| Model Registry |
Version and track models |
MLflow, Weights & Biases |
| Model Serving |
Serve predictions |
TensorFlow Serving, Triton, vLLM |
| Monitoring |
Track model performance |
Evidently, WhyLabs, Arize |
Feature Store Architecture
Why Feature Stores?
Problems without a feature store:
- Training-serving skew (features computed differently)
- Duplicate feature computation across teams
- No feature versioning or lineage
- Slow feature experimentation
Feature Store Components
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE STORE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ OFFLINE STORE │ │ ONLINE STORE │ │
│ │ │ │ │ │
│ │ - Historical data │ │ - Low-latency │ │
│ │ - Training queries │ ────▶ │ - Point lookups │ │
│ │ - Batch features │ sync │ - Real-time serving│ │
│ │ │ │ │ │
│ │ (Data Warehouse) │ │ (Redis, DynamoDB) │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ FEATURE REGISTRY ││
│ │ - Feature definitions - Version control ││
│ │ - Data lineage - Access control ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Feature Types
| Type |
Computation |
Storage |
Example |
| Batch |
Scheduled (hourly/daily) |
Offline → Online |
User purchase count (30 days) |
| Streaming |
Real-time event processing |
Direct to online |
Items in cart (current) |
| On-demand |
Request-time computation |
Not stored |
Distance to nearest store |
Training-Serving Consistency
TRAINING (Historical):
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Historical │───▶│ Point-in-Time│───▶│ Training │
│ Events │ │ Join │ │ Dataset │
└──────────────┘ └──────────────┘ └──────────────┘
│
Uses feature
definitions
│
SERVING (Real-time): ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Online │───▶│ Same Feature │───▶│ Prediction │
│ Store │ │ Definitions │ │ Request │
└──────────────┘ └──────────────┘ └──────────────┘
Model Training Infrastructure
Training Pipeline Components
┌───────────────────────────────────────────────────────────────────────┐
│ TRAINING PIPELINE │
├───────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Data │──▶│ Feature │──▶│ Model │──▶│ Model │ │
│ │ Loader │ │ Transform│ │ Train │ │ Validate │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Experiment │ │ Hyperparameter│ │ Checkpoint │ │ Model │ │
│ │ Tracking │ │ Tuning │ │ Storage │ │ Registry │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────┘
Training Infrastructure Patterns
| Pattern |
Use Case |
Tools |
| Single-node |
Small datasets, quick experiments |
Jupyter, local GPU |
| Distributed data-parallel |
Large datasets, same model |
Horovod, PyTorch DDP |
| Model-parallel |
Large models that don't fit in memory |
DeepSpeed, FSDP, Megatron |
| Hyperparameter tuning |
Automated model optimization |
Optuna, Ray Tune |
Experiment Tracking
Track for reproducibility:
| What to Track |
Why |
| Hyperparameters |
Reproduce training runs |
| Metrics |
Compare model performance |
| Artifacts |
Model files, datasets |
| Code version |
Git commit hash |
| Environment |
Docker image, dependencies |
| Data version |
Dataset hash or snapshot |
Model Serving Architecture
Serving Patterns
| Pattern |
Latency |
Throughput |
Use Case |
| Online (REST/gRPC) |
Low (<100ms) |
Medium |
Real-time predictions |
| Batch |
High (hours) |
Very high |
Bulk scoring |
| Streaming |
Medium |
High |
Event-driven predictions |
| Embedded |
Very low |
Varies |
Edge/mobile inference |
Online Serving Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL SERVING SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Clients │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Load Balancer│ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ API Gateway │ │
│ │ - Authentication - Rate limiting - Request validation │ │
│ └──────────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────┼───────────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Model A │ │ Model B │ │ Model C │ │
│ │ (v1.2) │ │ (v2.0) │ │ (v1.0) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Feature Store │ │
│ │ (Online) │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Latency Optimization
| Technique |
Latency Impact |
Trade-off |
| Batching |
Reduces per-request |
Increases latency for first request |
| Caching |
10-100x faster |
May serve stale predictions |
| Quantization |
2-4x faster |
Slight accuracy loss |
| Distillation |
Variable |
Training overhead |
| GPU inference |
10-100x faster |
Cost increase |
A/B Testing ML Models
Experiment Design
┌─────────────────────────────────────────────────────────────────────┐
│ A/B TESTING ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Traffic │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Experiment Assignment │ ◀─────── Experiment Config │
│ │ - User bucketing │ - Allocation % │
│ │ - Feature flags │ - Target segments │
│ └──────────┬───────────┘ - Guardrails │
│ │ │
│ ┌────────┴────────┐ │
│ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │
│ │Control │ │Treatment│ │
│ │Model A │ │Model B │ │
│ └────┬───┘ └────┬───┘ │
│ │ │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Metrics Logger │ │
│ └────────┬───────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Statistical │ ─────▶ Decision: Ship / Iterate / Kill │
│ │ Analysis │ │
│ └────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Metrics to Track
| Metric Type |
Examples |
Purpose |
| Model metrics |
AUC, RMSE, precision/recall |
Model quality |
| Business metrics |
CTR, conversion, revenue |
Business impact |
| Guardrail metrics |
Latency, error rate, engagement |
Prevent regressions |
| Segment metrics |
Metrics by user segment |
Detect heterogeneous effects |
Statistical Considerations
- Sample size: Calculate power before experiment
- Duration: Account for novelty effects and time patterns
- Multiple testing: Adjust for multiple metrics (Bonferroni, FDR)
- Early stopping: Use sequential testing methods
Model Monitoring
What to Monitor
| Category |
Metrics |
Alert Threshold |
| Data quality |
Missing values, schema drift |
>1% change |
| Feature drift |
Distribution shift (PSI, KL) |
PSI >0.2 |
| Prediction drift |
Output distribution shift |
Depends on use case |
| Model performance |
Accuracy, AUC (when labels available) |
>5% degradation |
| Operational |
Latency, throughput, errors |
SLO violations |
Drift Detection
┌─────────────────────────────────────────────────────────────────────┐
│ DRIFT DETECTION PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Training Data Production Data │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Reference │ │ Current │ │
│ │ Distribution │ │ Distribution │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └──────────────┬──────────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Statistical Test │ │
│ │ - PSI (Population Stability Index) │
│ │ - KS Test │
│ │ - Chi-squared │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Drift Score │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌───────────┼───────────┐ │
│ ▼ ▼ ▼ │
│ No Drift Warning Critical │
│ (< 0.1) (0.1-0.2) (> 0.2) │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Continue Investigate Retrain │
│ │
└─────────────────────────────────────────────────────────────────────┘
Common ML System Design Patterns
Pattern 1: Recommendation System
Components needed:
- Candidate Generation (retrieve 100s-1000s)
- Ranking Model (score and sort)
- Feature Store (user features, item features)
- Real-time personalization (recent behavior)
- A/B testing infrastructure
Pattern 2: Fraud Detection
Components needed:
- Real-time feature computation
- Low-latency model serving (<50ms)
- High recall focus (can't miss fraud)
- Explainability for compliance
- Human-in-the-loop review
- Feedback loop for labels
Pattern 3: Search Ranking
Components needed:
- Two-stage ranking (retrieval + ranking)
- Feature store for query/document features
- Low latency (<200ms end-to-end)
- Learning to rank models
- Click-through rate prediction
- A/B testing with interleaving
Estimation for ML Systems
Training Infrastructure
Training time estimation:
- Dataset size: 100M examples
- Model: Transformer (100M params)
- GPU: A100 (80GB, 312 TFLOPS)
- Batch size: 32
- Training steps: Dataset / batch = 3.1M steps
- Time per step: ~100ms
- Total time: ~86 hours single GPU
- With 8 GPUs (data parallel): ~11 hours
Serving Infrastructure
Inference estimation:
- QPS: 10,000
- Model latency: 20ms
- Batch size: 1 (real-time)
- GPU utilization: 50% (latency constraint)
- Requests per GPU/sec: 25
- GPUs needed: 10,000 / 25 = 400 GPUs
- With batching (batch 8): 100 GPUs (4x reduction)
Related Skills
llm-serving-patterns - LLM-specific serving and optimization
rag-architecture - Retrieval-Augmented Generation patterns
vector-databases - Vector search and embeddings
ml-inference-optimization - Latency and cost optimization
estimation-techniques - Back-of-envelope calculations
quality-attributes-taxonomy - NFR definitions
Related Commands
/sd:ml-pipeline <problem> - Design ML system interactively
/sd:estimate <scenario> - Capacity calculations
Related Agents
ml-systems-designer - Design ML architectures
ml-interviewer - Mock ML system design interviews
Version History
- v1.0.0 (2025-12-26): Initial release
Last Updated
Date: 2025-12-26
Model: claude-opus-4-5-20251101