LLM Serving Patterns
When to Use This Skill
Use this skill when:
- Designing LLM inference infrastructure
- Choosing between serving frameworks (vLLM, TGI, TensorRT-LLM)
- Implementing quantization for production deployment
- Optimizing batching and throughput
- Building streaming response systems
- Scaling LLM deployments cost-effectively
Keywords: LLM serving, inference, vLLM, TGI, TensorRT-LLM, quantization, INT8, INT4, FP16, batching, continuous batching, streaming, SSE, WebSocket, KV cache, PagedAttention, speculative decoding
LLM Serving Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Serving Stack │
├─────────────────────────────────────────────────────────────────────┤
│ Clients (API, Chat UI, Agents) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Load Balancer / API Gateway │ │
│ │ • Rate limiting • Authentication • Request routing │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Inference Server │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Request │ │ Batching │ │ KV Cache │ │ │
│ │ │ Queue │──▶│ Engine │──▶│ Management │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ Model Execution Engine │ │ │
│ │ │ • Tensor operations • Attention • Token sampling │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU/TPU Cluster │ │
│ │ • Model sharding • Tensor parallelism • Pipeline parallel │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Serving Framework Comparison
| Framework |
Strengths |
Best For |
Considerations |
| vLLM |
PagedAttention, high throughput, continuous batching |
General LLM serving, high concurrency |
Python-native, active community |
| TGI (Text Generation Inference) |
Production-ready, Hugging Face integration |
Enterprise deployment, HF models |
Rust backend, Docker-first |
| TensorRT-LLM |
NVIDIA optimization, lowest latency |
NVIDIA GPUs, latency-critical |
NVIDIA-only, complex setup |
| Triton Inference Server |
Multi-model, multi-framework |
Heterogeneous model serving |
Enterprise complexity |
| Ollama |
Simple local deployment |
Development, edge deployment |
Limited scaling features |
| llama.cpp |
CPU inference, quantization |
Resource-constrained, edge |
C++ integration required |
Framework Selection Decision Tree
Need lowest latency on NVIDIA GPUs?
├── Yes → TensorRT-LLM
└── No
└── Need high throughput with many concurrent users?
├── Yes → vLLM (PagedAttention)
└── No
└── Need enterprise features + HF integration?
├── Yes → TGI
└── No
└── Simple local/edge deployment?
├── Yes → Ollama or llama.cpp
└── No → vLLM (general purpose)
Quantization Techniques
Precision Levels
| Precision |
Bits |
Memory Reduction |
Quality Impact |
Use Case |
| FP32 |
32 |
Baseline |
None |
Training, reference |
| FP16/BF16 |
16 |
2x |
Minimal |
Standard serving |
| INT8 |
8 |
4x |
Low |
Production serving |
| INT4 |
4 |
8x |
Moderate |
Resource-constrained |
| INT2 |
2 |
16x |
Significant |
Experimental |
Quantization Methods
| Method |
Description |
Quality |
Speed |
| PTQ (Post-Training Quantization) |
Quantize after training, no retraining |
Good |
Fast to apply |
| QAT (Quantization-Aware Training) |
Simulate quantization during training |
Better |
Requires training |
| GPTQ |
One-shot weight quantization |
Very good |
Moderate |
| AWQ (Activation-aware Weight Quantization) |
Preserves salient weights |
Excellent |
Moderate |
| GGUF/GGML |
llama.cpp format, CPU-optimized |
Good |
Very fast inference |
| SmoothQuant |
Migrates difficulty to weights |
Excellent |
Moderate |
Quantization Selection
Quality vs. Efficiency Trade-off:
Quality ────────────────────────────────────────────▶ Efficiency
│ │
│ FP32 FP16 INT8+AWQ INT8+GPTQ INT4 INT2 │
│ ○───────○────────○──────────○──────────○──────○ │
│ │ │ │ │ │ │ │
│ Best Great Good Good Fair Poor │
│ │
Batching Strategies
Static Batching
Request 1: [tokens: 100] ─┐
Request 2: [tokens: 50] ─┼──▶ [Batch: pad to 100] ──▶ Process ──▶ All complete
Request 3: [tokens: 80] ─┘
Problem: Short requests wait for long ones (head-of-line blocking)
Continuous Batching (Preferred)
Time ──────────────────────────────────────────────────────────▶
Req 1: [████████████████████████████████] ──▶ Complete
Req 2: [████████████] ──▶ Complete ──▶ Req 4 starts [████████████████]
Req 3: [████████████████████] ──▶ Complete ──▶ Req 5 starts [████████]
• New requests join batch as others complete
• No padding waste
• Optimal GPU utilization
Batching Parameters
| Parameter |
Description |
Trade-off |
max_batch_size |
Maximum concurrent requests |
Memory vs. throughput |
max_waiting_tokens |
Tokens before forcing batch |
Latency vs. throughput |
max_num_seqs |
Maximum sequences in batch |
Memory vs. concurrency |
KV Cache Management
The KV Cache Problem
Attention: Q × K^T × V
For each token generated:
• Must recompute attention with ALL previous tokens
• K and V tensors grow with sequence length
• Memory: O(batch_size × seq_len × num_layers × hidden_dim)
Example (70B model, 4K context):
• KV cache per request: ~8GB
• 10 concurrent requests: ~80GB GPU memory
PagedAttention (vLLM Innovation)
Traditional KV Cache:
┌──────────────────────────────────────────┐
│ Request 1 KV Cache (contiguous, fixed) │ ← Wastes memory
├──────────────────────────────────────────┤
│ Request 2 KV Cache (contiguous, fixed) │
├──────────────────────────────────────────┤
│ FRAGMENTED/WASTED SPACE │
└──────────────────────────────────────────┘
PagedAttention:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ R1 │ R2 │ R1 │ R3 │ R2 │ R1 │ R3 │ R2 │ ← Pages allocated on demand
└────┴────┴────┴────┴────┴────┴────┴────┘
• Non-contiguous memory allocation
• Near-zero memory waste
• 2-4x higher throughput
KV Cache Optimization Strategies
| Strategy |
Description |
Memory Savings |
| Paged Attention |
Virtual memory for KV cache |
~50% reduction |
| Prefix Caching |
Reuse KV cache for common prefixes |
System prompt: 100% |
| Quantized KV Cache |
INT8/FP8 for KV values |
50-75% reduction |
| Sliding Window |
Limited attention context |
Linear memory |
| MQA/GQA |
Grouped query attention |
Architecture-dependent |
Streaming Response Patterns
Server-Sent Events (SSE)
Client Server
│ │
│──── GET /v1/chat/completions ──────▶│
│ (stream: true) │
│ │
│◀──── HTTP 200 OK ───────────────────│
│ Content-Type: text/event-stream│
│ │
│◀──── data: {"token": "Hello"} ──────│
│◀──── data: {"token": " world"} ─────│
│◀──── data: {"token": "!"} ──────────│
│◀──── data: [DONE] ──────────────────│
│ │
SSE Benefits:
- HTTP/1.1 compatible
- Auto-reconnection support
- Simple to implement
- Wide client support
WebSocket Streaming
Client Server
│ │
│──── WebSocket Upgrade ─────────────▶│
│◀──── 101 Switching Protocols ───────│
│ │
│──── {"prompt": "Hello"} ───────────▶│
│ │
│◀──── {"token": "Hi"} ───────────────│
│◀──── {"token": " there"} ───────────│
│◀──── {"token": "!"} ────────────────│
│◀──── {"done": true} ────────────────│
│ │
WebSocket Benefits:
- Bidirectional communication
- Lower latency
- Better for chat applications
- Connection persistence
Streaming Implementation Considerations
| Aspect |
SSE |
WebSocket |
| Reconnection |
Built-in |
Manual |
| Scalability |
Per-request |
Connection pool |
| Load Balancing |
Standard HTTP |
Sticky sessions |
| Firewall/Proxy |
Usually works |
May need config |
| Best For |
One-way streaming |
Interactive chat |
Speculative Decoding
Concept
Standard Decoding:
Large Model: [T1] → [T2] → [T3] → [T4] → [T5]
10ms 10ms 10ms 10ms 10ms = 50ms total
Speculative Decoding:
Draft Model: [T1, T2, T3, T4, T5] (parallel, 5ms)
│
▼
Large Model: [Verify T1-T5 in one pass] (15ms)
Accept: T1, T2, T3 ✓ Reject: T4, T5 ✗
│
▼
[Generate T4, T5 correctly]
Total: ~25ms (2x speedup if 60% acceptance)
Speculative Decoding Trade-offs
| Factor |
Impact |
| Draft model quality |
Higher match rate = more speedup |
| Draft model size |
Larger = better quality, slower |
| Speculation depth |
More tokens = higher risk/reward |
| Verification cost |
Must be < sequential generation |
Scaling Strategies
Horizontal Scaling
┌─────────────────────────────────────────────────────────┐
│ Load Balancer │
│ (Round-robin, Least-connections) │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ (GPU×4) │ │ (GPU×4) │ │ (GPU×4) │
└─────────┘ └─────────┘ └─────────┘
Model Parallelism
| Strategy |
Description |
Use Case |
| Tensor Parallelism |
Split layers across GPUs |
Single large model |
| Pipeline Parallelism |
Different layers on different GPUs |
Very large models |
| Data Parallelism |
Same model, different batches |
High throughput |
Tensor Parallelism (TP=4):
┌─────────────────────────────────────────┐
│ Layer N │
│ GPU0 │ GPU1 │ GPU2 │ GPU3 │
│ 25% │ 25% │ 25% │ 25% │
└─────────────────────────────────────────┘
Pipeline Parallelism (PP=4):
GPU0: Layers 0-7
GPU1: Layers 8-15
GPU2: Layers 16-23
GPU3: Layers 24-31
Latency Optimization Checklist
Pre-deployment
Runtime
Infrastructure
Cost Optimization
Cost Drivers
| Factor |
Impact |
Optimization |
| GPU hours |
Highest |
Quantization, batching |
| Memory |
High |
PagedAttention, KV cache optimization |
| Network |
Medium |
Response compression, edge deployment |
| Storage |
Low |
Model deduplication |
Cost Estimation Formula
Monthly Cost =
(Requests/month) × (Avg tokens/request) × (GPU-seconds/token) × ($/GPU-hour)
─────────────────────────────────────────────────────────────────────────────
3600
Example:
• 10M requests/month
• 500 tokens average
• 0.001 GPU-seconds/token (optimized)
• $2/GPU-hour
Cost = (10M × 500 × 0.001 × 2) / 3600 = $2,778/month
Common Patterns
Multi-model Routing
┌─────────────────────────────────────────────────────────┐
│ Router │
│ • Classify request complexity │
│ • Route to appropriate model │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Small │ │ Medium │ │ Large │
│ Model │ │ Model │ │ Model │
│ (7B) │ │ (13B) │ │ (70B) │
│ Fast │ │ Balanced│ │ Quality │
└─────────┘ └─────────┘ └─────────┘
Caching Strategies
| Cache Type |
What to Cache |
TTL |
| Prompt cache |
Common system prompts |
Long |
| KV cache |
Prefix tokens |
Session |
| Response cache |
Exact query matches |
Varies |
| Embedding cache |
Document embeddings |
Long |
Related Skills
ml-system-design - End-to-end ML pipeline design
rag-architecture - Retrieval-augmented generation patterns
vector-databases - Vector search for LLM context
ml-inference-optimization - General inference optimization
estimation-techniques - Capacity planning for LLM systems
Version History
- v1.0.0 (2025-12-26): Initial release - LLM serving patterns for systems design interviews
Last Updated
Date: 2025-12-26