| name | model-compression-techniques |
| description | Compress models for edge deployment and cost reduction via pruning, distillation, and optimization. |
Model Compression Techniques
When to Use This Skill
Use this skill when:
- Deploying models to edge devices (mobile, IoT, embedded systems)
- Model too large for deployment constraints (storage, memory, bandwidth)
- Inference costs too high (need smaller/faster model)
- Need to balance model size, speed, and accuracy
- Combining multiple compression techniques (quantization + pruning + distillation)
When NOT to use:
- Model already fits deployment constraints (compression unnecessary)
- Training optimization needed (use training-optimization pack instead)
- Quantization is sufficient (use quantization-for-inference instead)
- LLM-specific optimization (use llm-specialist for KV cache, speculative decoding)
Relationship with quantization-for-inference:
- Quantization: Reduce precision (FP32 → INT8/INT4) - 4× size reduction
- Compression: Reduce architecture (pruning, distillation) - 2-10× size reduction
- Often combined: Quantization + pruning + distillation = 10-50× total reduction
Core Principle
Compression is not one-size-fits-all. Architecture and deployment target determine technique.
Without systematic compression:
- Mobile deployment: 440MB model crashes 2GB devices
- Wrong technique: Pruning transformers → 33pp accuracy drop
- Unstructured pruning: No speedup on standard hardware
- Aggressive distillation: 77× compression produces gibberish
- No recovery: 5pp preventable accuracy loss
Formula: Architecture analysis (transformer vs CNN) + Deployment constraints (hardware, latency, size) + Technique selection (pruning vs distillation) + Quality preservation (recovery, progressive compression) = Production-ready compressed model.
Compression Decision Framework
Model Compression Decision Tree
1. What is target deployment?
├─ Edge/Mobile (strict size/memory) → Aggressive compression (4-10×)
├─ Cloud/Server (cost optimization) → Moderate compression (2-4×)
└─ On-premises (moderate constraints) → Balanced approach
2. What is model architecture?
├─ Transformer (BERT, GPT, T5)
│ └─ Primary: Knowledge distillation (preserves attention)
│ └─ Secondary: Layer dropping, quantization
│ └─ AVOID: Aggressive unstructured pruning (destroys quality)
│
├─ CNN (ResNet, EfficientNet, MobileNet)
│ └─ Primary: Structured channel pruning (works well)
│ └─ Secondary: Quantization (INT8 standard)
│ └─ Tertiary: Knowledge distillation (classification tasks)
│
└─ RNN/LSTM
└─ Primary: Quantization (safe, effective)
└─ Secondary: Structured pruning (hidden dimension)
└─ AVOID: Unstructured pruning (breaks sequential dependencies)
3. What is deployment hardware?
├─ CPU/GPU/Mobile (standard) → Structured pruning (actual speedup)
└─ Specialized (A100, sparse accelerators) → Unstructured pruning possible
4. What is acceptable quality loss?
├─ <2pp → Conservative: Quantization only (4× reduction)
├─ 2-5pp → Moderate: Quantization + structured pruning (6-10× reduction)
└─ >5pp → Aggressive: Full pipeline with distillation (10-50× reduction)
5. Combine techniques for maximum compression:
Quantization (4×) + Pruning (2×) + Distillation (2×) = 16× total reduction
Part 1: Structured vs Unstructured Pruning
When to Use Each
Unstructured Pruning:
- Use when: Sparse hardware available (NVIDIA A100, specialized accelerators)
- Benefit: Highest compression (70-90% sparsity possible)
- Drawback: No speedup on standard hardware (computes zeros anyway)
- Hardware support: Rare (most deployments use standard CPU/GPU)
Structured Pruning:
- Use when: Standard hardware (CPU, GPU, mobile) - 99% of deployments
- Benefit: Actual speedup (smaller dense matrices)
- Drawback: Lower compression ratio (50-70% typical)
- Variants: Channel pruning (CNNs), layer dropping (transformers), attention head pruning
Structured Channel Pruning (CNNs)
Problem: Unstructured pruning creates sparse tensors that don't accelerate on standard hardware.
Solution: Remove entire channels to create smaller dense model (actual speedup).
import torch
import torch.nn as nn
import torch_pruning as tp
def structured_channel_pruning_cnn(model, pruning_ratio=0.5, example_input=None):
"""
Structured channel pruning for CNNs (actual speedup on all hardware).
WHY structured: Removes entire channels/filters, creating smaller dense model
WHY works: Smaller dense matrices compute faster than sparse matrices on standard hardware
Args:
model: CNN model to prune
pruning_ratio: Fraction of channels to remove (0.5 = remove 50%)
example_input: Example input tensor for tracing dependencies
Returns:
Pruned model (smaller, faster)
"""
if example_input is None:
example_input = torch.randn(1, 3, 224, 224)
# Define importance metric (L1 norm of channels)
# WHY L1 norm: Channels with small L1 norm contribute less to output
importance = tp.importance.MagnitudeImportance(p=1)
# Create pruner
pruner = tp.pruner.MagnitudePruner(
model,
example_inputs=example_input,
importance=importance,
pruning_ratio=pruning_ratio,
global_pruning=False # Prune each layer independently
)
# Execute pruning (removes channels, creates smaller model)
# WHY remove channels: Conv2d(64, 128) → Conv2d(32, 64) after 50% pruning
pruner.step()
return model
# Example: Prune ResNet18
from torchvision.models import resnet18
model = resnet18(pretrained=True)
print(f"Original model size: {get_model_size(model):.1f}MB") # 44.7MB
print(f"Original params: {count_parameters(model):,}") # 11,689,512
# Apply 50% channel pruning
model_pruned = structured_channel_pruning_cnn(
model,
pruning_ratio=0.5,
example_input=torch.randn(1, 3, 224, 224)
)
print(f"Pruned model size: {get_model_size(model_pruned):.1f}MB") # 22.4MB (50% reduction)
print(f"Pruned params: {count_parameters(model_pruned):,}") # 5,844,756 (50% reduction)
# Benchmark inference speed
# WHY faster: Smaller dense matrices (fewer FLOPs, less memory bandwidth)
original_time = benchmark_inference(model) # 25ms
pruned_time = benchmark_inference(model_pruned) # 12.5ms (2× FASTER!)
# Accuracy (before fine-tuning)
original_acc = evaluate(model) # 69.8%
pruned_acc = evaluate(model_pruned) # 64.2% (5.6pp drop - needs fine-tuning)
# Fine-tune to recover accuracy
fine_tune(model_pruned, epochs=5, lr=1e-4)
pruned_acc_recovered = evaluate(model_pruned) # 68.5% (1.3pp drop, acceptable)
Helper functions:
def get_model_size(model):
"""Calculate model size in MB."""
# WHY: Multiply parameters by 4 bytes (FP32)
param_size = sum(p.numel() for p in model.parameters()) * 4 / (1024 ** 2)
return param_size
def count_parameters(model):
"""Count trainable parameters."""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
def benchmark_inference(model, num_runs=100):
"""Benchmark inference time (ms)."""
import time
model.eval()
example_input = torch.randn(1, 3, 224, 224)
# Warmup
with torch.no_grad():
for _ in range(10):
model(example_input)
# Benchmark
start = time.time()
with torch.no_grad():
for _ in range(num_runs):
model(example_input)
end = time.time()
return (end - start) / num_runs * 1000 # Convert to ms
Iterative Pruning (Quality Preservation)
Problem: Pruning all at once (50% in one step) → 10pp accuracy drop.
Solution: Iterative pruning (5 steps × 10% each) → 2pp accuracy drop.
def iterative_pruning(model, target_ratio=0.5, num_iterations=5, finetune_epochs=2):
"""
Iterative pruning with fine-tuning between steps.
WHY iterative: Gradual pruning allows model to adapt
WHY fine-tune: Remaining weights compensate for removed weights
Example: 50% pruning
- One-shot: 50% pruning → 10pp accuracy drop
- Iterative: 5 steps × 10% each → 2pp accuracy drop (better!)
Args:
model: Model to prune
target_ratio: Final pruning ratio (0.5 = remove 50% of weights)
num_iterations: Number of pruning steps (more = gradual = better quality)
finetune_epochs: Fine-tuning epochs after each step
Returns:
Pruned model with quality preservation
"""
# Calculate pruning amount per iteration
# WHY: Distribute total pruning across iterations
amount_per_iteration = 1 - (1 - target_ratio) ** (1 / num_iterations)
print(f"Pruning {target_ratio*100:.0f}% in {num_iterations} steps")
print(f"Amount per step: {amount_per_iteration*100:.1f}%")
for step in range(num_iterations):
print(f"\n=== Iteration {step+1}/{num_iterations} ===")
# Prune this iteration
# WHY global_unstructured: Prune across all layers (balanced sparsity)
parameters_to_prune = [
(module, "weight")
for module in model.modules()
if isinstance(module, (nn.Linear, nn.Conv2d))
]
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount=amount_per_iteration
)
# Evaluate current accuracy
acc_before_finetune = evaluate(model)
print(f"Accuracy after pruning: {acc_before_finetune:.2f}%")
# Fine-tune to recover accuracy
# WHY: Allow remaining weights to compensate for removed weights
fine_tune(model, epochs=finetune_epochs, lr=1e-4)
acc_after_finetune = evaluate(model)
print(f"Accuracy after fine-tuning: {acc_after_finetune:.2f}%")
# Make pruning permanent (remove masks)
for module, param_name in parameters_to_prune:
prune.remove(module, param_name)
return model
# Example usage
model = resnet18(pretrained=True)
original_acc = evaluate(model) # 69.8%
# One-shot pruning (worse quality)
model_oneshot = copy.deepcopy(model)
prune_global_unstructured(model_oneshot, amount=0.5) # Prune 50% immediately
oneshot_acc = evaluate(model_oneshot) # 59.7% (10.1pp drop!)
# Iterative pruning (better quality)
model_iterative = copy.deepcopy(model)
model_iterative = iterative_pruning(
model_iterative,
target_ratio=0.5, # 50% pruning
num_iterations=5, # Gradual over 5 steps
finetune_epochs=2 # Fine-tune after each step
)
iterative_acc = evaluate(model_iterative) # 67.5% (2.3pp drop, much better!)
# Quality comparison:
# - One-shot: 10.1pp drop (unacceptable)
# - Iterative: 2.3pp drop (acceptable)
Structured Layer Pruning (Transformers)
Problem: Transformers sensitive to unstructured pruning (destroys attention patterns).
Solution: Drop entire layers (structured pruning for transformers).
def drop_transformer_layers(model, num_layers_to_drop=6):
"""
Drop transformer layers (structured pruning for transformers).
WHY drop layers: Transformers learn hierarchical features, later layers refine
WHY not unstructured: Attention patterns are dense, pruning destroys them
Example: BERT-base (12 layers) → BERT-small (6 layers)
- Size: 440MB → 220MB (2× reduction)
- Speed: 2× faster (half the layers)
- Accuracy: 95% → 92% (3pp drop with fine-tuning)
Args:
model: Transformer model (BERT, GPT, T5)
num_layers_to_drop: Number of layers to remove
Returns:
Smaller transformer model
"""
# Identify which layers to drop
# WHY drop middle layers: Keep early (low-level features) and late (task-specific)
# Alternative: Drop early or late layers depending on task
total_layers = len(model.encoder.layer) # BERT example
layers_to_keep = total_layers - num_layers_to_drop
# Drop middle layers (preserve early and late layers)
start_idx = num_layers_to_drop // 2
end_idx = start_idx + layers_to_keep
new_layers = model.encoder.layer[start_idx:end_idx]
model.encoder.layer = nn.ModuleList(new_layers)
# Update config
model.config.num_hidden_layers = layers_to_keep
print(f"Dropped {num_layers_to_drop} layers ({total_layers} → {layers_to_keep})")
return model
# Example: Compress BERT-base
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
print(f"Original: {model.config.num_hidden_layers} layers, {get_model_size(model):.0f}MB")
# Original: 12 layers, 440MB
# Drop 6 layers (50% reduction)
model_compressed = drop_transformer_layers(model, num_layers_to_drop=6)
print(f"Compressed: {model_compressed.config.num_hidden_layers} layers, {get_model_size(model_compressed):.0f}MB")
# Compressed: 6 layers, 220MB
# Accuracy before fine-tuning
original_acc = evaluate(model) # 95.2%
compressed_acc = evaluate(model_compressed) # 88.5% (6.7pp drop)
# Fine-tune to recover accuracy
# WHY fine-tune: Remaining layers adapt to missing layers
fine_tune(model_compressed, epochs=3, lr=2e-5)
compressed_acc_recovered = evaluate(model_compressed) # 92.1% (3.1pp drop, acceptable)
Part 2: Knowledge Distillation
Progressive Distillation (Quality Preservation)
Problem: Single-stage aggressive distillation fails (77× compression → unusable quality).
Solution: Progressive distillation in multiple stages (2-4× per stage).
def progressive_distillation(
teacher,
num_stages=2,
compression_per_stage=2.5,
distill_epochs=10,
finetune_epochs=3
):
"""
Progressive knowledge distillation (quality preservation for aggressive compression).
WHY progressive: Large capacity gap (teacher → tiny student) loses too much knowledge
WHY multi-stage: Smooth transition preserves quality (teacher → intermediate → final)
Example: 6× compression
- Single-stage: 774M → 130M (6× in one step) → 15pp accuracy drop (bad)
- Progressive: 774M → 310M → 130M (2.5× per stage) → 5pp accuracy drop (better)
Args:
teacher: Large pre-trained model (e.g., BERT-large, GPT-2 Large)
num_stages: Number of distillation stages (2-3 typical)
compression_per_stage: Compression ratio per stage (2-4× safe)
distill_epochs: Distillation training epochs per stage
finetune_epochs: Fine-tuning epochs on hard labels
Returns:
Final compressed student model
"""
current_teacher = teacher
teacher_params = count_parameters(teacher)
print(f"Teacher: {teacher_params:,} params")
for stage in range(num_stages):
print(f"\n=== Stage {stage+1}/{num_stages} ===")
# Calculate student capacity for this stage
# WHY: Reduce by compression_per_stage factor
student_params = teacher_params // (compression_per_stage ** (stage + 1))
# Create student architecture (model-specific)
# WHY smaller: Fewer layers, smaller hidden dimension, fewer heads
student = create_student_model(
teacher_architecture=current_teacher,
target_params=student_params
)
print(f"Student {stage+1}: {count_parameters(student):,} params")
# Stage 1: Distillation training (learn from teacher)
# WHY soft targets: Teacher's probability distribution (richer than hard labels)
student = train_distillation(
teacher=current_teacher,
student=student,
train_loader=train_loader,
epochs=distill_epochs,
temperature=2.0, # WHY 2.0: Softer probabilities (more knowledge transfer)
alpha=0.7 # WHY 0.7: Weight distillation loss higher than hard loss
)
# Stage 2: Fine-tuning on hard labels (task optimization)
# WHY: Optimize student for actual task performance (not just mimicking teacher)
student = fine_tune_on_labels(
student=student,
train_loader=train_loader,
epochs=finetune_epochs,
lr=2e-5
)
# Evaluate this stage
teacher_acc = evaluate(current_teacher)
student_acc = evaluate(student)
print(f"Teacher accuracy: {teacher_acc:.2f}%")
print(f"Student accuracy: {student_acc:.2f}% (drop: {teacher_acc - student_acc:.2f}pp)")
# Student becomes teacher for next stage
current_teacher = student
return student
def create_student_model(teacher_architecture, target_params):
"""
Create student model with target parameter count.
WHY: Match architecture type but scale down capacity
"""
# Example for BERT
if isinstance(teacher_architecture, BertForSequenceClassification):
# Scale down: fewer layers, smaller hidden size, fewer heads
# WHY: Preserve architecture but reduce capacity
teacher_config = teacher_architecture.config
# Calculate scaling factor
scaling_factor = (target_params / count_parameters(teacher_architecture)) ** 0.5
student_config = BertConfig(
num_hidden_layers=int(teacher_config.num_hidden_layers * scaling_factor),
hidden_size=int(teacher_config.hidden_size * scaling_factor),
num_attention_heads=max(1, int(teacher_config.num_attention_heads * scaling_factor)),
intermediate_size=int(teacher_config.intermediate_size * scaling_factor),
num_labels=teacher_config.num_labels
)
return BertForSequenceClassification(student_config)
# Add other architectures as needed
raise ValueError(f"Unsupported architecture: {type(teacher_architecture)}")
def train_distillation(teacher, student, train_loader, epochs, temperature, alpha):
"""
Train student to mimic teacher (knowledge distillation).
WHY distillation loss: Student learns soft targets (probability distributions)
WHY temperature: Softens probabilities (exposes dark knowledge)
"""
import torch.nn.functional as F
teacher.eval() # WHY: Teacher is frozen (pre-trained knowledge)
student.train()
optimizer = torch.optim.AdamW(student.parameters(), lr=2e-5)
for epoch in range(epochs):
total_loss = 0
for batch, labels in train_loader:
# Teacher predictions (soft targets)
with torch.no_grad():
teacher_logits = teacher(batch).logits
# Student predictions
student_logits = student(batch).logits
# Distillation loss (KL divergence with temperature scaling)
# WHY temperature: Softens probabilities, exposes similarities between classes
soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
soft_predictions = F.log_softmax(student_logits / temperature, dim=-1)
distillation_loss = F.kl_div(
soft_predictions,
soft_targets,
reduction='batchmean'
) * (temperature ** 2) # WHY T^2: Scale loss appropriately
# Hard label loss (cross-entropy with ground truth)
hard_loss = F.cross_entropy(student_logits, labels)
# Combined loss
# WHY alpha: Balance distillation (learn from teacher) and hard loss (task performance)
loss = alpha * distillation_loss + (1 - alpha) * hard_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}")
return student
# Example usage: Compress GPT-2 Large (774M) to GPT-2 Small (124M)
from transformers import GPT2LMHeadModel
teacher = GPT2LMHeadModel.from_pretrained('gpt2-large') # 774M params
# Progressive distillation (2 stages, 2.5× per stage = 6.25× total)
student_final = progressive_distillation(
teacher=teacher,
num_stages=2,
compression_per_stage=2.5, # 2.5× per stage
distill_epochs=10,
finetune_epochs=3
)
# Results:
# - Teacher (GPT-2 Large): 774M params, perplexity 18.5
# - Student 1 (intermediate): 310M params, perplexity 22.1 (3.6pp drop)
# - Student 2 (final): 124M params, perplexity 28.5 (10pp drop)
# - Single-stage (direct 774M → 124M): perplexity 45.2 (26.7pp drop, much worse!)
Capacity Matching Guidelines
Problem: Student too small → can't learn teacher knowledge. Student too large → inefficient.
Solution: Match student capacity to compression target and quality tolerance.
def calculate_optimal_student_capacity(
teacher_params,
target_compression,
quality_tolerance,
architecture_type
):
"""
Calculate optimal student model capacity.
Compression guidelines:
- 2-4× compression: Minimal quality loss (1-3pp)
- 4-8× compression: Acceptable quality loss (3-7pp) with fine-tuning
- 8-15× compression: Significant quality loss (7-15pp), risky
- >15× compression: Usually fails, student lacks capacity
Progressive distillation for >4× compression:
- Stage 1: Teacher → Student 1 (2-4× compression)
- Stage 2: Student 1 → Student 2 (2-4× compression)
- Total: 4-16× compression with quality preservation
Args:
teacher_params: Number of parameters in teacher model
target_compression: Desired compression ratio (e.g., 6.0 for 6× smaller)
quality_tolerance: Acceptable accuracy drop (pp)
architecture_type: "transformer", "cnn", "rnn"
Returns:
(student_params, num_stages, compression_per_stage)
"""
# Compression difficulty by architecture
# WHY: Different architectures have different distillation friendliness
difficulty_factor = {
"transformer": 1.0, # Distills well (attention patterns transferable)
"cnn": 0.8, # Distills very well (spatial features transferable)
"rnn": 1.2 # Distills poorly (sequential dependencies fragile)
}[architecture_type]
# Adjust target compression by difficulty
effective_compression = target_compression * difficulty_factor
# Determine number of stages
if effective_compression <= 4:
# Single-stage distillation sufficient
num_stages = 1
compression_per_stage = effective_compression
elif effective_compression <= 16:
# Two-stage distillation
num_stages = 2
compression_per_stage = effective_compression ** 0.5
else:
# Three-stage distillation (or warn that compression is too aggressive)
num_stages = 3
compression_per_stage = effective_compression ** (1/3)
if quality_tolerance < 0.15: # <15pp drop
print(f"WARNING: {target_compression}× compression may exceed quality tolerance")
print(f"Consider: Target compression {target_compression/2:.1f}× instead")
# Calculate final student capacity
student_params = teacher_params / target_compression
return student_params, num_stages, compression_per_stage
# Example usage
teacher_params = 774_000_000 # GPT-2 Large
# Conservative compression (2× - safe)
student_params, stages, per_stage = calculate_optimal_student_capacity(
teacher_params=teacher_params,
target_compression=2.0,
quality_tolerance=0.03, # Accept 3pp drop
architecture_type="transformer"
)
print(f"2× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
# Output: 387M params, 1 stage, 2.0× per stage
# Moderate compression (6× - requires planning)
student_params, stages, per_stage = calculate_optimal_student_capacity(
teacher_params=teacher_params,
target_compression=6.0,
quality_tolerance=0.10, # Accept 10pp drop
architecture_type="transformer"
)
print(f"6× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
# Output: 129M params, 2 stages, 2.4× per stage
# Aggressive compression (15× - risky)
student_params, stages, per_stage = calculate_optimal_student_capacity(
teacher_params=teacher_params,
target_compression=15.0,
quality_tolerance=0.20, # Accept 20pp drop
architecture_type="transformer"
)
print(f"15× compression: {student_params:,} params, {stages} stage(s), {per_stage:.1f}× per stage")
# Output: 52M params, 3 stages, 2.5× per stage
# WARNING: High quality loss expected
Part 3: Low-Rank Decomposition
Singular Value Decomposition (SVD) for Linear Layers
Problem: Large weight matrices (e.g., 4096×4096 in transformers) consume memory.
Solution: Decompose into two smaller matrices (low-rank factorization).
import torch
import torch.nn as nn
def decompose_linear_layer_svd(layer, rank_ratio=0.5):
"""
Decompose linear layer using SVD (low-rank approximation).
WHY: Large matrix W (m×n) → two smaller matrices U (m×r) and V (r×n)
WHY works: Weight matrices often have low effective rank (redundancy)
Example: Linear(4096, 4096) with 50% rank
- Original: 16.8M parameters (4096×4096)
- Decomposed: 4.1M parameters (4096×2048 + 2048×4096) - 4× reduction!
Args:
layer: nn.Linear layer to decompose
rank_ratio: Fraction of original rank to keep (0.5 = keep 50%)
Returns:
Sequential module with two linear layers (equivalent to original)
"""
# Get weight matrix
W = layer.weight.data # Shape: (out_features, in_features)
bias = layer.bias.data if layer.bias is not None else None
# Perform SVD: W = U @ S @ V^T
# WHY SVD: Optimal low-rank approximation (minimizes reconstruction error)
U, S, Vt = torch.linalg.svd(W, full_matrices=False)
# Determine rank to keep
original_rank = min(W.shape)
target_rank = int(original_rank * rank_ratio)
print(f"Original rank: {original_rank}, Target rank: {target_rank}")
# Truncate to target rank
# WHY keep largest singular values: They capture most of the information
U_k = U[:, :target_rank] # Shape: (out_features, target_rank)
S_k = S[:target_rank] # Shape: (target_rank,)
Vt_k = Vt[:target_rank, :] # Shape: (target_rank, in_features)
# Create two linear layers: W ≈ U_k @ diag(S_k) @ Vt_k
# Layer 1: Linear(in_features, target_rank) with weights Vt_k
# Layer 2: Linear(target_rank, out_features) with weights U_k @ diag(S_k)
layer1 = nn.Linear(W.shape[1], target_rank, bias=False)
layer1.weight.data = Vt_k
layer2 = nn.Linear(target_rank, W.shape[0], bias=(bias is not None))
layer2.weight.data = U_k * S_k.unsqueeze(0) # Incorporate S into second layer
if bias is not None:
layer2.bias.data = bias
# Return sequential module (equivalent to original layer)
return nn.Sequential(layer1, layer2)
# Example: Decompose large transformer feedforward layer
original_layer = nn.Linear(4096, 4096)
print(f"Original params: {count_parameters(original_layer):,}") # 16,781,312
# Decompose with 50% rank retention
decomposed_layer = decompose_linear_layer_svd(original_layer, rank_ratio=0.5)
print(f"Decomposed params: {count_parameters(decomposed_layer):,}") # 4,194,304 (4× reduction!)
# Verify reconstruction quality
x = torch.randn(1, 128, 4096) # Example input
y_original = original_layer(x)
y_decomposed = decomposed_layer(x)
reconstruction_error = torch.norm(y_original - y_decomposed) / torch.norm(y_original)
print(f"Reconstruction error: {reconstruction_error.item():.4f}") # Small error (good approximation)
Apply SVD to Entire Model
def decompose_model_svd(model, rank_ratio=0.5, layer_threshold=1024):
"""
Apply SVD decomposition to all large linear layers in model.
WHY selective: Only decompose large layers (small layers don't benefit)
WHY threshold: Layers with <1024 input/output features too small to benefit
Args:
model: Model to compress
rank_ratio: Fraction of rank to keep (0.5 = 2× reduction per layer)
layer_threshold: Minimum layer size to decompose (skip small layers)
Returns:
Model with decomposed layers
"""
for name, module in model.named_children():
if isinstance(module, nn.Linear):
# Only decompose large layers
if module.in_features >= layer_threshold and module.out_features >= layer_threshold:
print(f"Decomposing {name}: {module.in_features}×{module.out_features}")
# Decompose layer
decomposed = decompose_linear_layer_svd(module, rank_ratio=rank_ratio)
# Replace in model
setattr(model, name, decomposed)
elif len(list(module.children())) > 0:
# Recursively decompose nested modules
decompose_model_svd(module, rank_ratio, layer_threshold)
return model
# Example: Compress transformer model
from transformers import BertModel
model = BertModel.from_pretrained('bert-base-uncased')
original_params = count_parameters(model)
print(f"Original params: {original_params:,}") # 109M
# Apply SVD (50% rank) to feedforward layers
model_compressed = decompose_model_svd(model, rank_ratio=0.5, layer_threshold=512)
compressed_params = count_parameters(model_compressed)
print(f"Compressed params: {compressed_params:,}") # 82M (1.3× reduction)
# Fine-tune to recover accuracy
# WHY: Low-rank approximation introduces small errors, fine-tuning compensates
fine_tune(model_compressed, epochs=3, lr=2e-5)
Part 4: Combined Compression Pipelines
Quantization + Pruning + Distillation
Problem: Single technique insufficient for aggressive compression (e.g., 20× for mobile).
Solution: Combine multiple techniques (multiplicative compression).
def full_compression_pipeline(
teacher_model,
target_compression=20,
deployment_target="mobile"
):
"""
Combined compression pipeline for maximum compression.
WHY combine techniques: Multiplicative compression
- Quantization: 4× reduction (FP32 → INT8)
- Pruning: 2× reduction (50% structured pruning)
- Distillation: 2.5× reduction (progressive distillation)
- Total: 4 × 2 × 2.5 = 20× reduction!
Pipeline order:
1. Knowledge distillation (preserve quality first)
2. Structured pruning (remove redundancy)
3. Quantization (reduce precision last)
WHY this order:
- Distillation first: Creates smaller model with quality preservation
- Pruning second: Removes redundancy from distilled model
- Quantization last: Works well on already-compressed model
Args:
teacher_model: Large pre-trained model to compress
target_compression: Desired compression ratio (e.g., 20 for 20× smaller)
deployment_target: "mobile", "edge", "server"
Returns:
Fully compressed model ready for deployment
"""
print(f"=== Full Compression Pipeline (target: {target_compression}× reduction) ===\n")
# Original model metrics
original_size = get_model_size(teacher_model)
original_params = count_parameters(teacher_model)
original_acc = evaluate(teacher_model)
print(f"Original: {original_params:,} params, {original_size:.1f}MB, {original_acc:.2f}% acc")
# Step 1: Knowledge Distillation (2-2.5× compression)
# WHY first: Preserves quality better than pruning teacher directly
print("\n--- Step 1: Knowledge Distillation ---")
distillation_ratio = min(2.5, target_compression ** (1/3)) # Allocate ~1/3 of compression
student_model = progressive_distillation(
teacher=teacher_model,
num_stages=2,
compression_per_stage=distillation_ratio ** 0.5,
distill_epochs=10,
finetune_epochs=3
)
student_size = get_model_size(student_model)
student_params = count_parameters(student_model)
student_acc = evaluate(student_model)
print(f"After distillation: {student_params:,} params, {student_size:.1f}MB, {student_acc:.2f}% acc")
print(f"Compression: {original_size/student_size:.1f}×")
# Step 2: Structured Pruning (1.5-2× compression)
# WHY after distillation: Prune smaller model (easier to maintain quality)
print("\n--- Step 2: Structured Pruning ---")
pruning_ratio = min(0.5, 1 - 1/(target_compression ** (1/3))) # Allocate ~1/3 of compression
pruned_model = iterative_pruning(
model=student_model,
target_ratio=pruning_ratio,
num_iterations=5,
finetune_epochs=2
)
pruned_size = get_model_size(pruned_model)
pruned_params = count_parameters(pruned_model)
pruned_acc = evaluate(pruned_model)
print(f"After pruning: {pruned_params:,} params, {pruned_size:.1f}MB, {pruned_acc:.2f}% acc")
print(f"Compression: {original_size/pruned_size:.1f}×")
# Step 3: Quantization (4× compression)
# WHY last: Works well on already-compressed model, easy to apply
print("\n--- Step 3: Quantization (INT8) ---")
# Quantization-aware training
pruned_model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(pruned_model)
# Fine-tune with fake quantization
fine_tune(model_prepared, epochs=3, lr=1e-4)
# Convert to INT8
model_prepared.eval()
quantized_model = torch.quantization.convert(model_prepared)
quantized_size = get_model_size(quantized_model)
quantized_acc = evaluate(quantized_model)
print(f"After quantization: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc")
print(f"Total compression: {original_size/quantized_size:.1f}×")
# Summary
print("\n=== Compression Pipeline Summary ===")
print(f"Original: {original_size:.1f}MB, {original_acc:.2f}% acc")
print(f"Distilled: {student_size:.1f}MB, {student_acc:.2f}% acc ({original_size/student_size:.1f}×)")
print(f"Pruned: {pruned_size:.1f}MB, {pruned_acc:.2f}% acc ({original_size/pruned_size:.1f}×)")
print(f"Quantized: {quantized_size:.1f}MB, {quantized_acc:.2f}% acc ({original_size/quantized_size:.1f}×)")
print(f"\nFinal compression: {original_size/quantized_size:.1f}× (target: {target_compression}×)")
print(f"Accuracy drop: {original_acc - quantized_acc:.2f}pp")
# Deployment checks
if deployment_target == "mobile":
assert quantized_size <= 100, f"Model too large for mobile ({quantized_size:.1f}MB > 100MB)"
assert quantized_acc >= original_acc - 5, f"Quality loss too high ({original_acc - quantized_acc:.2f}pp)"
print("\n✓ Ready for mobile deployment")
return quantized_model
# Example: Compress BERT for mobile deployment
from transformers import BertForSequenceClassification
teacher = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# Target: 20× compression (440MB → 22MB)
compressed_model = full_compression_pipeline(
teacher_model=teacher,
target_compression=20,
deployment_target="mobile"
)
# Results:
# - Original: 440MB, 95.2% acc
# - Distilled: 180MB, 93.5% acc (2.4× compression)
# - Pruned: 90MB, 92.8% acc (4.9× compression)
# - Quantized: 22MB, 92.1% acc (20× compression!)
# - Accuracy drop: 3.1pp (acceptable for mobile)
Part 5: Architecture Optimization
Neural Architecture Search for Efficiency
Problem: Manual architecture design for compression is time-consuming.
Solution: Automated search for efficient architectures (NAS for compression).
def efficient_architecture_search(
task_type,
target_latency_ms,
target_accuracy,
search_space="mobilenet"
):
"""
Search for efficient architecture meeting constraints.
WHY NAS: Automated discovery of architectures optimized for efficiency
WHY search space: MobileNet, EfficientNet designed for edge deployment
Search strategies:
- Width multiplier: Scale number of channels (0.5× - 1.5×)
- Depth multiplier: Scale number of layers (0.75× - 1.25×)
- Resolution multiplier: Scale input resolution (128px - 384px)
Args:
task_type: "classification", "detection", "segmentation"
target_latency_ms: Maximum inference latency (ms)
target_accuracy: Minimum acceptable accuracy
search_space: "mobilenet", "efficientnet", "custom"
Returns:
Optimal architecture configuration
"""
# Example: MobileNetV3 search space
# WHY MobileNet: Designed for mobile (depthwise separable convolutions)
from torchvision.models import mobilenet_v3_small, mobilenet_v3_large
configurations = [
{
"model": mobilenet_v3_small,
"width_multiplier": 0.5,
"expected_latency": 8, # ms on mobile CPU
"expected_accuracy": 60.5 # ImageNet top-1
},
{
"model": mobilenet_v3_small,
"width_multiplier": 1.0,
"expected_latency": 15,
"expected_accuracy": 67.4
},
{
"model": mobilenet_v3_large,
"width_multiplier": 0.75,
"expected_latency": 25,
"expected_accuracy": 73.3
},
{
"model": mobilenet_v3_large,
"width_multiplier": 1.0,
"expected_latency": 35,
"expected_accuracy": 75.2
}
]
# Find configurations meeting constraints
# WHY filter: Only consider configs within latency budget and accuracy requirement
valid_configs = [
config for config in configurations
if config["expected_latency"] <= target_latency_ms
and config["expected_accuracy"] >= target_accuracy
]
if not valid_configs:
print(f"No configuration meets constraints (latency={target_latency_ms}ms, accuracy={target_accuracy}%)")
print("Consider: Relax constraints or use custom search")
return None
# Select best (highest accuracy within latency budget)
best_config = max(valid_configs, key=lambda c: c["expected_accuracy"])
print(f"Selected: {best_config['model'].__name__} (width={best_config['width_multiplier']})")
print(f"Expected: {best_config['expected_latency']}ms, {best_config['expected_accuracy']}% acc")
return best_config
# Example usage: Find architecture for mobile deployment
config = efficient_architecture_search(
task_type="classification",
target_latency_ms=20, # 20ms latency budget
target_accuracy=70.0, # Minimum 70% accuracy
search_space="mobilenet"
)
# Output: Selected MobileNetV3-Large (width=0.75)
# Expected: 25ms latency, 73.3% accuracy
# Meets both constraints!
Part 6: Quality Preservation Strategies
Trade-off Analysis Framework
def analyze_compression_tradeoffs(
model,
compression_techniques,
deployment_constraints
):
"""
Analyze compression technique trade-offs.
WHY: Different techniques have different trade-offs
- Quantization: Best size/speed, minimal quality loss (0.5-2pp)
- Pruning: Good size/speed, moderate quality loss (2-5pp)
- Distillation: Excellent quality, requires training time
Args:
model: Model to compress
compression_techniques: List of techniques to try
deployment_constraints: Dict with size_mb, latency_ms, accuracy_min
Returns:
Recommended technique and expected metrics
"""
results = []
# Quantization (FP32 → INT8)
if "quantization" in compression_techniques:
results.append({
"technique": "quantization",
"compression_ratio": 4.0,
"expected_accuracy_drop": 0.5, # 0.5-2pp with QAT
"training_time_hours": 2, # QAT training
"complexity": "low"
})
# Structured pruning (50%)
if "pruning" in compression_techniques:
results.append({
"technique": "structured_pruning_50%",
"compression_ratio": 2.0,
"expected_accuracy_drop": 2.5, # 2-5pp with iterative pruning
"training_time_hours": 8, # Iterative pruning + fine-tuning
"complexity": "medium"
})
# Knowledge distillation (2× compression)
if "distillation" in compression_techniques:
results.append({
"technique": "distillation_2x",
"compression_ratio": 2.0,
"expected_accuracy_drop": 1.5, # 1-3pp
"training_time_hours": 20, # Full distillation training
"complexity": "high"
})
# Combined (quantization + pruning)
if "combined" in compression_techniques:
results.append({
"technique": "quantization+pruning",
"compression_ratio": 8.0, # 4× × 2× = 8×
"expected_accuracy_drop": 3.5, # Additive: 0.5 + 2.5 + interaction
"training_time_hours": 12, # Pruning + QAT
"complexity": "high"
})
# Filter by constraints
original_size = get_model_size(model)
original_acc = evaluate(model)
valid_techniques = [
r for r in results
if (original_size / r["compression_ratio"]) <= deployment_constraints["size_mb"]
and (original_acc - r["expected_accuracy_drop"]) >= deployment_constraints["accuracy_min"]
]
if not valid_techniques:
print("No technique meets all constraints")
return None
# Recommend technique (prioritize: best quality, then fastest training, then simplest)
best = min(
valid_techniques,
key=lambda r: (r["expected_accuracy_drop"], r["training_time_hours"], r["complexity"])
)
print(f"Recommended: {best['technique']}")
print(f"Expected: {original_size/best['compression_ratio']:.1f}MB (from {original_size:.1f}MB)")
print(f"Accuracy: {original_acc - best['expected_accuracy_drop']:.1f}% (drop: {best['expected_accuracy_drop']}pp)")
print(f"Training time: {best['training_time_hours']} hours")
return best
# Example usage
deployment_constraints = {
"size_mb": 50, # Model must be <50MB
"latency_ms": 100, # <100ms inference
"accuracy_min": 90.0 # >90% accuracy
}
recommendation = analyze_compression_tradeoffs(
model=my_model,
compression_techniques=["quantization", "pruning", "distillation", "combined"],
deployment_constraints=deployment_constraints
)
Common Mistakes to Avoid
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| "Pruning works for all architectures" | Destroys transformer attention | Use distillation for transformers |
| "More compression is always better" | 77× compression produces gibberish | Progressive distillation for >4× |
| "Unstructured pruning speeds up inference" | No speedup on standard hardware | Use structured pruning (channel/layer) |
| "Quantize and deploy immediately" | 5pp accuracy drop without recovery | QAT + fine-tuning for quality preservation |
| "Single technique is enough" | Can't reach aggressive targets (20×) | Combine: quantization + pruning + distillation |
| "Skip fine-tuning to save time" | Preventable accuracy loss | Always include recovery step |
Success Criteria
You've correctly compressed a model when:
✅ Selected appropriate technique for architecture (distillation for transformers, pruning for CNNs) ✅ Matched student capacity to compression target (2-4× per stage, progressive for >4×) ✅ Used structured pruning for standard hardware (actual speedup) ✅ Applied iterative/progressive compression (quality preservation) ✅ Included accuracy recovery (QAT, fine-tuning, calibration) ✅ Achieved target compression with acceptable quality loss (<5pp for most tasks) ✅ Verified deployment constraints (size, latency, accuracy) are met
References
Key papers:
- DistilBERT (Sanh et al., 2019): Knowledge distillation for transformers
- The Lottery Ticket Hypothesis (Frankle & Carbin, 2019): Iterative magnitude pruning
- Pruning Filters for Efficient ConvNets (Li et al., 2017): Structured channel pruning
- Deep Compression (Han et al., 2016): Pruning + quantization + Huffman coding
When to combine with other skills:
- Use with quantization-for-inference: Quantization (4×) + compression (2-5×) = 8-20× total
- Use with hardware-optimization-strategies: Optimize compressed model for target hardware
- Use with model-serving-patterns: Deploy compressed model with batching/caching