ML Inference Optimization
When to Use This Skill
Use this skill when:
- Optimizing ML inference latency
- Reducing model size for deployment
- Implementing model compression techniques
- Designing inference caching strategies
- Deploying models at the edge
- Balancing accuracy vs. latency trade-offs
Keywords: inference optimization, latency, model compression, distillation, pruning, quantization, caching, edge ML, TensorRT, ONNX, model serving, batching, hardware acceleration
Inference Optimization Overview
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Optimization Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Model Level │ │
│ │ Distillation │ Pruning │ Quantization │ Architecture Search │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Compiler Level │ │
│ │ Graph optimization │ Operator fusion │ Memory planning │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Runtime Level │ │
│ │ Batching │ Caching │ Async execution │ Multi-threading │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Hardware Level │ │
│ │ GPU │ TPU │ NPU │ CPU SIMD │ Custom accelerators │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Model Compression Techniques
Technique Overview
| Technique |
Size Reduction |
Speed Improvement |
Accuracy Impact |
| Quantization |
2-4x |
2-4x |
Low (1-2%) |
| Pruning |
2-10x |
1-3x |
Low-Medium |
| Distillation |
3-10x |
3-10x |
Medium |
| Low-rank factorization |
2-5x |
1.5-3x |
Low-Medium |
| Weight sharing |
10-100x |
Variable |
Medium-High |
Knowledge Distillation
┌─────────────────────────────────────────────────────────────────────┐
│ Knowledge Distillation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ Teacher Model│ (Large, accurate, slow) │
│ │ GPT-4 │ │
│ └──────────────┘ │
│ │ │
│ ▼ Soft labels (probability distributions) │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Training Process │ │
│ │ Loss = α × CrossEntropy(student, hard_labels) │ │
│ │ + (1-α) × KL_Div(student, teacher_soft_labels) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │Student Model │ (Small, nearly as accurate, fast) │
│ │ DistilBERT │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Distillation Types:
| Type |
Description |
Use Case |
| Response distillation |
Match teacher outputs |
General compression |
| Feature distillation |
Match intermediate layers |
Better transfer |
| Relation distillation |
Match sample relationships |
Structured data |
| Self-distillation |
Model teaches itself |
Regularization |
Pruning Strategies
Unstructured Pruning (Weight-level):
Before: [0.1, 0.8, 0.2, 0.9, 0.05, 0.7]
After: [0.0, 0.8, 0.0, 0.9, 0.0, 0.7] (50% sparse)
• Flexible, high sparsity possible
• Needs sparse hardware/libraries
Structured Pruning (Channel/Layer-level):
Before: ┌───┬───┬───┬───┐
│ C1│ C2│ C3│ C4│
└───┴───┴───┴───┘
After: ┌───┬───┬───┐
│ C1│ C3│ C4│ (Removed C2 entirely)
└───┴───┴───┘
• Works with standard hardware
• Lower compression ratio
Pruning Decision Criteria:
| Method |
Description |
Effectiveness |
| Magnitude-based |
Remove smallest weights |
Simple, effective |
| Gradient-based |
Remove low-gradient weights |
Better accuracy |
| Second-order |
Use Hessian information |
Best but expensive |
| Lottery ticket |
Find winning subnetwork |
Theoretical insight |
Quantization (Detailed)
Precision Hierarchy:
FP32 (32 bits): ████████████████████████████████
FP16 (16 bits): ████████████████
BF16 (16 bits): ████████████████ (different mantissa/exponent)
INT8 (8 bits): ████████
INT4 (4 bits): ████
Binary (1 bit): █
Memory and Compute Scale Proportionally
Quantization Approaches:
| Approach |
When Applied |
Quality |
Effort |
| Dynamic quantization |
Runtime |
Good |
Low |
| Static quantization |
Post-training with calibration |
Better |
Medium |
| QAT |
During training |
Best |
High |
Compiler-Level Optimization
Graph Optimization
Original Graph:
Input → Conv → BatchNorm → ReLU → Conv → BatchNorm → ReLU → Output
Optimized Graph (Operator Fusion):
Input → FusedConvBNReLU → FusedConvBNReLU → Output
Benefits:
• Fewer kernel launches
• Better memory locality
• Reduced memory bandwidth
Common Optimizations
| Optimization |
Description |
Speedup |
| Operator fusion |
Combine sequential ops |
1.2-2x |
| Constant folding |
Pre-compute constants |
1.1-1.5x |
| Dead code elimination |
Remove unused ops |
Variable |
| Layout optimization |
Optimize tensor memory layout |
1.1-1.3x |
| Memory planning |
Optimize buffer allocation |
1.1-1.2x |
Optimization Frameworks
| Framework |
Vendor |
Best For |
| TensorRT |
NVIDIA |
NVIDIA GPUs, lowest latency |
| ONNX Runtime |
Microsoft |
Cross-platform, broad support |
| OpenVINO |
Intel |
Intel CPUs/GPUs |
| Core ML |
Apple |
Apple devices |
| TFLite |
Google |
Mobile, embedded |
| Apache TVM |
Open source |
Custom hardware, research |
Runtime Optimization
Batching Strategies
No Batching:
Request 1: [Process] → Response 1 10ms
Request 2: [Process] → Response 2 10ms
Request 3: [Process] → Response 3 10ms
Total: 30ms, GPU underutilized
Dynamic Batching:
Requests 1-3: [Wait 5ms] → [Process batch] → Responses
Total: 15ms, 2x throughput
Trade-off: Latency vs. Throughput
• Larger batch: Higher throughput, higher latency
• Smaller batch: Lower latency, lower throughput
Batching Parameters:
| Parameter |
Description |
Trade-off |
batch_size |
Maximum batch size |
Throughput vs. latency |
max_wait_time |
Wait time for batch fill |
Latency vs. efficiency |
min_batch_size |
Minimum before processing |
Latency predictability |
Caching Strategies
┌─────────────────────────────────────────────────────────────────────┐
│ Inference Caching Layers │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: Input Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache exact inputs → Return cached outputs │ │
│ │ Hit rate: Low (inputs rarely repeat exactly) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 2: Embedding Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache computed embeddings for repeated tokens/entities │ │
│ │ Hit rate: Medium (common tokens repeat) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 3: KV Cache (for transformers) │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache key-value pairs for attention │ │
│ │ Hit rate: High (reuse across tokens in sequence) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Layer 4: Result Cache │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cache semantic equivalents (fuzzy matching) │ │
│ │ Hit rate: Variable (depends on query distribution) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Semantic Caching for LLMs:
Query: "What's the capital of France?"
↓
Hash + Embed query
↓
Search cache (similarity > threshold)
↓
├── Hit: Return cached response
└── Miss: Generate → Cache → Return
Async and Parallel Execution
Sequential:
┌─────┐ ┌─────┐ ┌─────┐
│Prep │→│Model│→│Post │ Total: 30ms
│10ms │ │15ms │ │5ms │
└─────┘ └─────┘ └─────┘
Pipelined:
Request 1: │Prep│Model│Post│
Request 2: │Prep│Model│Post│
Request 3: │Prep│Model│Post│
Throughput: 3x higher
Latency per request: Same
Hardware Acceleration
Hardware Comparison
| Hardware |
Strengths |
Limitations |
Best For |
| GPU (NVIDIA) |
High parallelism, mature ecosystem |
Power, cost |
Training, large batch inference |
| TPU (Google) |
Matrix ops, cloud integration |
Vendor lock-in |
Google Cloud workloads |
| NPU (Apple/Qualcomm) |
Power efficient, on-device |
Limited models |
Mobile, edge |
| CPU |
Flexible, available |
Slower for ML |
Low-batch, CPU-bound |
| FPGA |
Customizable, low latency |
Development complexity |
Specialized workloads |
GPU Optimization
| Optimization |
Description |
Impact |
| Tensor Cores |
Use FP16/INT8 tensor operations |
2-8x speedup |
| CUDA graphs |
Reduce kernel launch overhead |
1.5-2x for small models |
| Multi-stream |
Parallel execution |
Higher throughput |
| Memory pooling |
Reduce allocation overhead |
Lower latency variance |
Edge Deployment
Edge Constraints
┌─────────────────────────────────────────────────────────────────────┐
│ Edge Deployment Constraints │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Resource Constraints: │
│ ├── Memory: 1-4 GB (vs. 64+ GB cloud) │
│ ├── Compute: 1-10 TOPS (vs. 100+ TFLOPS cloud) │
│ ├── Power: 5-15W (vs. 300W+ cloud) │
│ └── Storage: 16-128 GB (vs. TB cloud) │
│ │
│ Operational Constraints: │
│ ├── No network (offline operation) │
│ ├── Variable ambient conditions │
│ ├── Infrequent updates │
│ └── Long deployment lifetime │
│ │
└─────────────────────────────────────────────────────────────────────┘
Edge Optimization Strategies
| Strategy |
Description |
Use When |
| Model selection |
Use edge-native models (MobileNet, EfficientNet) |
Accuracy acceptable |
| Aggressive quantization |
INT8 or lower |
Memory/power constrained |
| On-device distillation |
Distill to tiny model |
Extreme constraints |
| Split inference |
Edge preprocessing, cloud inference |
Network available |
| Model caching |
Cache results locally |
Repeated queries |
Edge ML Frameworks
| Framework |
Platform |
Features |
| TensorFlow Lite |
Android, iOS, embedded |
Quantization, delegates |
| Core ML |
iOS, macOS |
Neural Engine optimization |
| ONNX Runtime Mobile |
Cross-platform |
Broad model support |
| PyTorch Mobile |
Android, iOS |
Familiar API |
| TensorRT |
NVIDIA Jetson |
Maximum performance |
Latency Profiling
Profiling Methodology
┌─────────────────────────────────────────────────────────────────────┐
│ Latency Breakdown Analysis │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Data Loading: ████████░░░░░░░░░░ 15% │
│ 2. Preprocessing: ██████░░░░░░░░░░░░ 10% │
│ 3. Model Inference: ████████████████░░ 60% │
│ 4. Postprocessing: ████░░░░░░░░░░░░░░ 8% │
│ 5. Response Serialization:███░░░░░░░░░░░░░░░ 7% │
│ │
│ Target: Model inference (60% = biggest optimization opportunity) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Profiling Tools
| Tool |
Use For |
| PyTorch Profiler |
PyTorch model profiling |
| TensorBoard |
TensorFlow visualization |
| NVIDIA Nsight |
GPU profiling |
| Chrome Tracing |
General timeline visualization |
| perf |
CPU profiling |
Key Metrics
| Metric |
Description |
Target |
| P50 latency |
Median latency |
< SLA |
| P99 latency |
Tail latency |
< 2x P50 |
| Throughput |
Requests/second |
Meet demand |
| GPU utilization |
Compute usage |
> 80% |
| Memory bandwidth |
Memory usage |
< limit |
Optimization Workflow
Systematic Approach
┌─────────────────────────────────────────────────────────────────────┐
│ Optimization Workflow │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline │
│ └── Measure current performance (latency, throughput, accuracy) │
│ │
│ 2. Profile │
│ └── Identify bottlenecks (model, data, system) │
│ │
│ 3. Optimize (in order of effort/impact): │
│ ├── Hardware: Use right accelerator │
│ ├── Compiler: Enable optimizations (TensorRT, ONNX) │
│ ├── Runtime: Batching, caching, async │
│ ├── Model: Quantization, pruning │
│ └── Architecture: Distillation, model change │
│ │
│ 4. Validate │
│ └── Verify accuracy maintained, latency improved │
│ │
│ 5. Deploy and Monitor │
│ └── Track real-world performance │
│ │
└─────────────────────────────────────────────────────────────────────┘
Optimization Priority Matrix
High Impact
│
Compiler Opts ────┼──── Quantization
(easy win) │ (best ROI)
│
Low Effort ──────────────┼──────────────── High Effort
│
Batching ────┼──── Distillation
(quick win) │ (major effort)
│
Low Impact
Common Patterns
Multi-Model Serving
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ Request → ┌─────────┐ │
│ │ Router │ │
│ └─────────┘ │
│ │ │ │ │
│ ┌────────┘ │ └────────┐ │
│ ▼ ▼ ▼ │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Tiny │ │ Small │ │ Large │ │
│ │ <10ms │ │ <50ms │ │<500ms │ │
│ └───────┘ └───────┘ └───────┘ │
│ │
│ Routing strategies: │
│ • Complexity-based: Simple→Tiny, Complex→Large │
│ • Confidence-based: Try Tiny, escalate if low confidence │
│ • SLA-based: Route based on latency requirements │
│ │
└─────────────────────────────────────────────────────────────────────┘
Speculative Execution
Query: "Translate: Hello"
│
├──▶ Small model (draft): "Bonjour" (5ms)
│
└──▶ Large model (verify): Check "Bonjour" (10ms parallel)
│
├── Accept: Return immediately
└── Reject: Generate with large model
Speedup: 2-3x when drafts are often accepted
Cascade Models
Input → ┌────────┐
│ Filter │ ← Cheap filter (reject obvious negatives)
└────────┘
│ (candidates only)
▼
┌────────┐
│ Stage 1│ ← Fast model (coarse ranking)
└────────┘
│ (top-100)
▼
┌────────┐
│ Stage 2│ ← Accurate model (fine ranking)
└────────┘
│ (top-10)
▼
Output
Benefit: 10x cheaper, similar accuracy
Optimization Checklist
Pre-Deployment
Deployment
Post-Deployment
Related Skills
llm-serving-patterns - LLM-specific serving optimization
ml-system-design - End-to-end ML pipeline design
quality-attributes-taxonomy - Performance as quality attribute
estimation-techniques - Capacity planning for ML systems
Version History
- v1.0.0 (2025-12-26): Initial release - ML inference optimization patterns
Last Updated
Date: 2025-12-26