Claude Code Plugins

Community-maintained marketplace

Feedback

quantization-for-inference

@tachyon-beep/skillpacks
1
0

Reduce model size and speed up inference via PTQ, QAT, and precision selection.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name quantization-for-inference
description Reduce model size and speed up inference via PTQ, QAT, and precision selection.

Quantization for Inference Skill

When to Use This Skill

Use this skill when you observe these symptoms:

Performance Symptoms:

  • Model inference too slow on CPU (e.g., >10ms when need <5ms)
  • Batch processing taking too long (low throughput)
  • Need to serve more requests per second with same hardware

Size Symptoms:

  • Model too large for edge devices (e.g., 100MB+ for mobile)
  • Want to fit more models in GPU memory
  • Memory-constrained deployment environment

Deployment Symptoms:

  • Deploying to CPU servers (quantization gives 2-4× CPU speedup)
  • Deploying to edge devices (mobile, IoT, embedded systems)
  • Cost-sensitive deployment (smaller models = lower hosting costs)

When NOT to use this skill:

  • Model already fast enough and small enough (no problem to solve)
  • Deploying exclusively on GPU with no memory constraints (modest benefit)
  • Prototyping phase where optimization is premature
  • Model so small that quantization overhead not worth it (e.g., <5MB)

Core Principle

Quantization trades precision for performance.

Quantization converts high-precision numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits or INT4: 4 bits). This provides:

  • 4-8× smaller model size (fewer bits per parameter)
  • 2-4× faster inference on CPU (INT8 operations faster than FP32)
  • Small accuracy loss (typically 0.5-1% for INT8)

Formula: Lower precision (FP32 → INT8 → INT4) = Smaller size + Faster inference + More accuracy loss

The skill is choosing the right precision for your accuracy tolerance.

Quantization Framework

┌────────────────────────────────────────────┐
│   1. Recognize Quantization Need           │
│   CPU/Edge + (Slow OR Large)               │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   2. Choose Quantization Type              │
│   Dynamic → Static → QAT (increasing cost) │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   3. Calibrate (if Static/QAT)             │
│   100-1000 representative samples          │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   4. Validate Accuracy Trade-offs          │
│   Baseline vs Quantized accuracy           │
└──────────────┬─────────────────────────────┘
               │
               ▼
┌────────────────────────────────────────────┐
│   5. Decide: Accept or Iterate             │
│   <2% loss → Deploy                        │
│   >2% loss → Try QAT or different precision│
└────────────────────────────────────────────┘

Part 1: Quantization Types

Type 1: Dynamic Quantization

What it does: Quantizes weights to INT8, keeps activations in FP32.

When to use:

  • Simplest quantization (no calibration needed)
  • Primary goal is size reduction
  • Batch processing where latency less critical
  • Quick experiment to see if quantization helps

Benefits:

  • ✅ 4× size reduction (weights are 75% of model size)
  • ✅ 1.2-1.5× CPU speedup (modest, because activations still FP32)
  • ✅ Minimal accuracy loss (~0.2-0.5%)
  • ✅ No calibration data needed

Limitations:

  • ⚠️ Limited CPU speedup (activations still FP32)
  • ⚠️ Not optimal for edge devices needing maximum performance

PyTorch implementation:

import torch
import torch.quantization

# WHY: Dynamic quantization is simplest - just one function call
# No calibration data needed because activations stay FP32
model = torch.load('model.pth')
model.eval()  # WHY: Must be in eval mode (no batchnorm updates)

# WHY: Specify which layers to quantize (Linear, LSTM, etc.)
# These layers benefit most from quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    qconfig_spec={torch.nn.Linear},  # WHY: Quantize Linear layers only
    dtype=torch.qint8  # WHY: INT8 is standard precision
)

# Save quantized model
torch.save(quantized_model.state_dict(), 'model_quantized_dynamic.pth')

# Verify size reduction
original_size = os.path.getsize('model.pth') / (1024 ** 2)  # MB
quantized_size = os.path.getsize('model_quantized_dynamic.pth') / (1024 ** 2)
print(f"Original: {original_size:.1f}MB → Quantized: {quantized_size:.1f}MB")
print(f"Size reduction: {original_size / quantized_size:.1f}×")

Example use case: BERT classification model where primary goal is reducing size from 440MB to 110MB for easier deployment.


Type 2: Static Quantization (Post-Training Quantization)

What it does: Quantizes both weights and activations to INT8.

When to use:

  • Need maximum CPU speedup (2-4×)
  • Deploying to CPU servers or edge devices
  • Can afford calibration step (5-10 minutes)
  • Primary goal is inference speed

Benefits:

  • ✅ 4× size reduction (same as dynamic)
  • ✅ 2-4× CPU speedup (both weights and activations INT8)
  • ✅ No retraining required (post-training)
  • ✅ Acceptable accuracy loss (~0.5-1%)

Requirements:

  • ⚠️ Needs calibration data (100-1000 samples from validation set)
  • ⚠️ Slightly more complex setup than dynamic

PyTorch implementation:

import torch
import torch.quantization

def calibrate_model(model, calibration_loader):
    """
    Calibrate model by running representative data through it.

    WHY: Static quantization needs to know activation ranges.
    Calibration finds min/max values for each activation layer.

    Args:
        model: Model in eval mode with quantization stubs
        calibration_loader: DataLoader with 100-1000 samples
    """
    model.eval()
    with torch.no_grad():
        for batch_idx, (data, _) in enumerate(calibration_loader):
            model(data)
            if batch_idx >= 100:  # WHY: 100 batches usually sufficient
                break
    return model

# Step 1: Prepare model for quantization
model = torch.load('model.pth')
model.eval()

# WHY: Insert quantization/dequantization stubs at boundaries
# This tells PyTorch where to convert between FP32 and INT8
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)

# Step 2: Calibrate with representative data
# WHY: Must use data from training/validation set, not random data
# Calibration finds activation ranges - needs real distribution
calibration_dataset = torch.utils.data.Subset(
    val_dataset,
    indices=range(1000)  # WHY: 1000 samples sufficient for most models
)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

model = calibrate_model(model, calibration_loader)

# Step 3: Convert to quantized model
torch.quantization.convert(model, inplace=True)

# Save quantized model
torch.save(model.state_dict(), 'model_quantized_static.pth')

# Benchmark speed improvement
import time

def benchmark(model, data, num_iterations=100):
    """WHY: Warm up model first, then measure average latency."""
    model.eval()
    # Warm up (first few iterations slower)
    for _ in range(10):
        model(data)

    start = time.time()
    with torch.no_grad():
        for _ in range(num_iterations):
            model(data)
    end = time.time()
    return (end - start) / num_iterations * 1000  # ms per inference

test_data = torch.randn(1, 3, 224, 224)  # Example input

baseline_latency = benchmark(original_model, test_data)
quantized_latency = benchmark(model, test_data)

print(f"Baseline: {baseline_latency:.2f}ms")
print(f"Quantized: {quantized_latency:.2f}ms")
print(f"Speedup: {baseline_latency / quantized_latency:.2f}×")

Example use case: ResNet50 image classifier for CPU inference - need <5ms latency, achieve 4ms with static quantization (vs 15ms baseline).


Type 3: Quantization-Aware Training (QAT)

What it does: Simulates quantization during training to minimize accuracy loss.

When to use:

  • Static quantization accuracy loss too large (>2%)
  • Need best possible accuracy with INT8
  • Can afford retraining (hours to days)
  • Critical production system with strict accuracy requirements

Benefits:

  • ✅ Best accuracy (~0.1-0.3% loss vs 0.5-1% for static)
  • ✅ 4× size reduction (same as dynamic/static)
  • ✅ 2-4× CPU speedup (same as static)

Limitations:

  • ⚠️ Requires retraining (most expensive option)
  • ⚠️ Takes hours to days depending on model size
  • ⚠️ More complex implementation

PyTorch implementation:

import torch
import torch.quantization

def train_one_epoch_qat(model, train_loader, optimizer, criterion):
    """
    Train one epoch with quantization-aware training.

    WHY: QAT inserts fake quantization ops during training.
    Model learns to be robust to quantization errors.
    """
    model.train()
    for data, target in train_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
    return model

# Step 1: Prepare model for QAT
model = torch.load('model.pth')
model.train()

# WHY: QAT config includes fake quantization ops
# These simulate quantization during forward pass
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.quantization.prepare_qat(model, inplace=True)

# Step 2: Train with quantization-aware training
# WHY: Model learns to compensate for quantization errors
optimizer = torch.optim.SGD(model.parameters(), lr=0.0001)  # WHY: Low LR for fine-tuning
criterion = torch.nn.CrossEntropyLoss()

num_epochs = 5  # WHY: Usually 5-10 epochs sufficient for QAT fine-tuning
for epoch in range(num_epochs):
    model = train_one_epoch_qat(model, train_loader, optimizer, criterion)
    print(f"Epoch {epoch+1}/{num_epochs} complete")

# Step 3: Convert to quantized model
model.eval()
torch.quantization.convert(model, inplace=True)

# Save QAT quantized model
torch.save(model.state_dict(), 'model_quantized_qat.pth')

Example use case: Medical imaging model where accuracy is critical - static quantization gives 2% accuracy loss, QAT reduces to 0.3%.


Part 2: Quantization Type Decision Matrix

Type Complexity Calibration Retraining Size Reduction CPU Speedup Accuracy Loss
Dynamic Low No No 1.2-1.5× ~0.2-0.5%
Static Medium Yes No 2-4× ~0.5-1%
QAT High Yes Yes 2-4× ~0.1-0.3%

Decision flow:

  1. Start with dynamic quantization: Simplest, verify quantization helps
  2. Upgrade to static quantization: If need more speedup, can afford calibration
  3. Use QAT: Only if accuracy loss from static too large (rare)

Why this order? Incremental cost. Dynamic is free (5 minutes), static is cheap (15 minutes), QAT is expensive (hours/days). Don't pay for QAT unless you need it.


Part 3: Calibration Best Practices

What is Calibration?

Purpose: Find min/max ranges for each activation layer.

Why needed: Static quantization needs to know activation ranges to map FP32 → INT8. Without calibration, ranges are wrong → accuracy collapses.

How it works:

  1. Run representative data through model
  2. Record min/max activation values per layer
  3. Use these ranges to quantize activations at inference time

Calibration Data Requirements

Data source:

  • Use validation set samples (matches training distribution)
  • ❌ Don't use random images from internet (different distribution)
  • ❌ Don't use single image repeated (insufficient coverage)
  • ❌ Don't use training set that doesn't match deployment (distribution shift)

Data size:

  • Minimum: 100 samples (sufficient for simple models)
  • Recommended: 500-1000 samples (better coverage)
  • Maximum: Full validation set is overkill (slow, no benefit)

Data characteristics:

  • Must cover range of inputs model sees in production
  • Include edge cases (bright/dark images, long/short text)
  • Distribution should match deployment, not just training
  • Class balance less important than input diversity

Example calibration data selection:

import torch
import numpy as np

def select_calibration_data(val_dataset, num_samples=1000):
    """
    Select diverse calibration samples from validation set.

    WHY: Want samples that cover range of activation values.
    Random selection from validation set usually sufficient.

    Args:
        val_dataset: Full validation dataset
        num_samples: Number of calibration samples (default 1000)

    Returns:
        Calibration dataset subset
    """
    # WHY: Random selection ensures diversity
    # Stratified sampling can help ensure class coverage
    indices = np.random.choice(len(val_dataset), num_samples, replace=False)
    calibration_dataset = torch.utils.data.Subset(val_dataset, indices)

    return calibration_dataset

# Example: Select 1000 random samples from validation set
calibration_dataset = select_calibration_data(val_dataset, num_samples=1000)
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=32,
    shuffle=False  # WHY: Order doesn't matter for calibration
)

Common Calibration Pitfalls

Pitfall 1: Using wrong data distribution

  • ❌ "Random images from internet" for ImageNet-trained model
  • ✅ Use ImageNet validation set samples

Pitfall 2: Too few samples

  • ❌ 10 samples (insufficient coverage of activation ranges)
  • ✅ 100-1000 samples (good coverage)

Pitfall 3: Using training data that doesn't match deployment

  • ❌ Calibrate on sunny outdoor images, deploy on indoor images
  • ✅ Calibrate on data matching deployment distribution

Pitfall 4: Skipping calibration validation

  • ❌ Calibrate once, assume it works
  • ✅ Validate accuracy after calibration to verify ranges are good

Part 4: Precision Selection (INT8 vs INT4 vs FP16)

Precision Spectrum

Precision Bits Size vs FP32 Speedup (CPU) Typical Accuracy Loss
FP32 32 0% (baseline)
FP16 16 1.5× <0.1%
INT8 8 2-4× 0.5-1%
INT4 4 4-8× 1-3%

Trade-off: Lower precision = Smaller size + Faster inference + More accuracy loss

When to Use Each Precision

FP16 (Half Precision):

  • GPU inference (Tensor Cores optimized for FP16)
  • Need minimal accuracy loss (<0.1%)
  • Size reduction secondary concern
  • Example: Large language models on GPU

INT8 (Standard Quantization):

  • CPU inference (INT8 operations fast on CPU)
  • Edge device deployment
  • Good balance of size/speed/accuracy
  • Most common choice for production deployment
  • Example: Image classification on mobile devices

INT4 (Aggressive Quantization):

  • Extremely memory-constrained (e.g., 1GB mobile devices)
  • Can tolerate larger accuracy loss (1-3%)
  • Need maximum size reduction (8×)
  • Use sparingly - accuracy risk high
  • Example: Large language models (LLaMA-7B: 13GB → 3.5GB)

Decision Flow

def choose_precision(accuracy_tolerance, deployment_target):
    """
    Choose quantization precision based on requirements.

    WHY: Different precisions for different constraints.
    INT8 is default, FP16 for GPU, INT4 for extreme memory constraints.
    """
    if accuracy_tolerance < 0.1:
        return "FP16"  # Minimal accuracy loss required
    elif deployment_target == "GPU":
        return "FP16"  # GPU optimized for FP16
    elif deployment_target in ["CPU", "edge"]:
        return "INT8"  # CPU optimized for INT8
    elif deployment_target == "extreme_edge" and accuracy_tolerance > 1:
        return "INT4"  # Only if can tolerate 1-3% loss
    else:
        return "INT8"  # Default safe choice

Part 5: ONNX Quantization (Cross-Framework)

When to use: Deploying to ONNX Runtime (CPU/edge devices) or need cross-framework compatibility.

ONNX Static Quantization

import onnxruntime
from onnxruntime.quantization import quantize_static, CalibrationDataReader
import numpy as np

class CalibrationDataReaderWrapper(CalibrationDataReader):
    """
    WHY: ONNX requires custom calibration data reader.
    This class feeds calibration data to ONNX quantization engine.
    """
    def __init__(self, calibration_data):
        self.calibration_data = calibration_data
        self.iterator = iter(calibration_data)

    def get_next(self):
        """WHY: Called by ONNX to get next calibration batch."""
        try:
            data, _ = next(self.iterator)
            return {"input": data.numpy()}  # WHY: Return dict of input name → data
        except StopIteration:
            return None

# Step 1: Export PyTorch model to ONNX
model = torch.load('model.pth')
model.eval()
dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    'model.onnx',
    input_names=['input'],
    output_names=['output'],
    opset_version=13  # WHY: ONNX opset 13+ supports quantization ops
)

# Step 2: Prepare calibration data
calibration_loader = torch.utils.data.DataLoader(
    calibration_dataset,
    batch_size=1,  # WHY: ONNX calibration uses batch size 1
    shuffle=False
)
calibration_reader = CalibrationDataReaderWrapper(calibration_loader)

# Step 3: Quantize ONNX model
quantize_static(
    'model.onnx',
    'model_quantized.onnx',
    calibration_data_reader=calibration_reader,
    quant_format='QDQ'  # WHY: QDQ format compatible with most backends
)

# Step 4: Benchmark ONNX quantized model
import time

session = onnxruntime.InferenceSession('model_quantized.onnx')
input_name = session.get_inputs()[0].name

test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warm up
for _ in range(10):
    session.run(None, {input_name: test_data})

# Benchmark
start = time.time()
for _ in range(100):
    session.run(None, {input_name: test_data})
end = time.time()

latency = (end - start) / 100 * 1000  # ms per inference
print(f"ONNX Quantized latency: {latency:.2f}ms")

ONNX advantages:

  • Cross-framework (works with PyTorch, TensorFlow, etc.)
  • Optimized ONNX Runtime for CPU inference
  • Good hardware backend support (x86, ARM)

Part 6: Accuracy Validation (Critical Step)

Why Accuracy Validation Matters

Quantization is lossy compression. Must measure accuracy impact:

  • Some models tolerate quantization well (<0.5% loss)
  • Some models sensitive to quantization (>2% loss)
  • Some layers more sensitive than others
  • Can't assume quantization is safe without measuring

Validation Methodology

def validate_quantization(original_model, quantized_model, val_loader):
    """
    Validate quantization by comparing accuracy.

    WHY: Quantization is lossy - must measure impact.
    Compare baseline vs quantized on same validation set.

    Returns:
        dict with baseline_acc, quantized_acc, accuracy_loss
    """
    def evaluate(model, data_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in data_loader:
                output = model(data)
                pred = output.argmax(dim=1)
                correct += (pred == target).sum().item()
                total += target.size(0)
        return 100.0 * correct / total

    baseline_acc = evaluate(original_model, val_loader)
    quantized_acc = evaluate(quantized_model, val_loader)
    accuracy_loss = baseline_acc - quantized_acc

    return {
        'baseline_acc': baseline_acc,
        'quantized_acc': quantized_acc,
        'accuracy_loss': accuracy_loss,
        'acceptable': accuracy_loss < 2.0  # WHY: <2% loss usually acceptable
    }

# Example validation
results = validate_quantization(original_model, quantized_model, val_loader)
print(f"Baseline accuracy: {results['baseline_acc']:.2f}%")
print(f"Quantized accuracy: {results['quantized_acc']:.2f}%")
print(f"Accuracy loss: {results['accuracy_loss']:.2f}%")
print(f"Acceptable: {results['acceptable']}")

# Decision logic
if results['acceptable']:
    print("✅ Quantization acceptable - deploy quantized model")
else:
    print("❌ Accuracy loss too large - try QAT or reconsider quantization")

Acceptable Accuracy Thresholds

General guidelines:

  • <1% loss: Excellent quantization result
  • 1-2% loss: Acceptable for most applications
  • 2-3% loss: Consider QAT to reduce loss
  • >3% loss: Quantization may not be suitable for this model

Task-specific thresholds:

  • Image classification: 1-2% top-1 accuracy loss acceptable
  • Object detection: 1-2% mAP loss acceptable
  • NLP classification: 0.5-1% accuracy loss acceptable
  • Medical/safety-critical: <0.5% loss required (use QAT)

Part 7: LLM Quantization (GPTQ, AWQ)

Note: This skill covers general quantization. For LLM-specific optimization (GPTQ, AWQ, KV cache, etc.), see the llm-inference-optimization skill in the llm-specialist pack.

LLM Quantization Overview

Why LLMs need quantization:

  • Very large (7B parameters = 13GB in FP16)
  • Memory-bound inference (limited by VRAM)
  • INT4 quantization: 13GB → 3.5GB (fits in consumer GPUs)

LLM-specific quantization methods:

  • GPTQ: Post-training quantization optimized for LLMs
  • AWQ: Activation-aware weight quantization (better quality than GPTQ)
  • Both: Achieve INT4 with <0.5 perplexity increase

When to Use LLM Quantization

Use when:

  • Deploying LLMs locally (consumer GPUs)
  • Memory-constrained (need to fit in 12GB/24GB VRAM)
  • Cost-sensitive (smaller models cheaper to host)
  • Latency-sensitive (smaller models faster to load)

Don't use when:

  • Have sufficient GPU memory for FP16
  • Accuracy critical (medical, legal applications)
  • Already using API (OpenAI, Anthropic) - they handle optimization

LLM Quantization References

For detailed LLM quantization:

  • See skill: llm-inference-optimization (llm-specialist pack)
  • Covers: GPTQ, AWQ, KV cache optimization, token streaming
  • Tools: llama.cpp, vLLM, text-generation-inference

Quick reference (defer to llm-specialist for details):

# GPTQ quantization (example - see llm-specialist for full details)
from transformers import AutoModelForCausalLM, GPTQConfig

# WHY: GPTQ optimizes layer-wise for minimal perplexity increase
quantization_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Result: 13GB → 3.5GB, <0.5 perplexity increase

Part 8: When NOT to Quantize

Scenario 1: Already Fast Enough

Example: MobileNetV2 (14MB, 3ms CPU latency)

  • Quantization: 14MB → 4MB, 3ms → 2ms
  • Benefit: 10MB saved, 1ms faster
  • Cost: Calibration, validation, testing, debugging
  • Decision: Not worth effort unless specific requirement

Rule: If current performance meets requirements, don't optimize.

Scenario 2: GPU-Only Deployment with No Memory Constraints

Example: ResNet50 on Tesla V100 with 32GB VRAM

  • Quantization: 1.5-2× GPU speedup (modest)
  • FP32 already fast on GPU (Tensor Cores optimized)
  • No memory pressure (plenty of VRAM)
  • Decision: Focus on other bottlenecks (data loading, I/O)

Rule: Quantization is most beneficial for CPU inference and memory-constrained GPU.

Scenario 3: Accuracy-Critical Applications

Example: Medical diagnosis model where misdiagnosis has severe consequences

  • Quantization introduces accuracy loss (even if small)
  • Risk not worth benefit
  • Decision: Keep FP32, optimize other parts (batching, caching)

Rule: Safety-critical systems should avoid lossy compression unless thoroughly validated.

Scenario 4: Prototyping Phase

Example: Early development, trying different architectures

  • Quantization is optimization - premature at prototype stage
  • Focus on getting model working first
  • Decision: Defer quantization until production deployment

Rule: Don't optimize until you need to (Knuth: "Premature optimization is root of all evil").


Part 9: Quantization Benchmarks (Expected Results)

Image Classification (ResNet50, ImageNet)

Metric FP32 Baseline Dynamic INT8 Static INT8 QAT INT8
Size 98MB 25MB (4×) 25MB (4×) 25MB (4×)
CPU Latency 15ms 12ms (1.25×) 4ms (3.75×) 4ms (3.75×)
Top-1 Accuracy 76.1% 75.9% (0.2% loss) 75.3% (0.8% loss) 75.9% (0.2% loss)

Insight: Static quantization gives 3.75× speedup with acceptable 0.8% accuracy loss.

Object Detection (YOLOv5s, COCO)

Metric FP32 Baseline Static INT8 QAT INT8
Size 14MB 4MB (3.5×) 4MB (3.5×)
CPU Latency 45ms 15ms (3×) 15ms (3×)
mAP@0.5 37.4% 36.8% (0.6% loss) 37.2% (0.2% loss)

Insight: QAT gives better accuracy (0.2% vs 0.6% loss) with same speedup.

NLP Classification (BERT-base, GLUE)

Metric FP32 Baseline Dynamic INT8 Static INT8
Size 440MB 110MB (4×) 110MB (4×)
CPU Latency 35ms 28ms (1.25×) 12ms (2.9×)
Accuracy 93.5% 93.2% (0.3% loss) 92.8% (0.7% loss)

Insight: Static quantization gives 2.9× speedup but dynamic sufficient if speedup not critical.

LLM Inference (LLaMA-7B)

Metric FP16 Baseline GPTQ INT4 AWQ INT4
Size 13GB 3.5GB (3.7×) 3.5GB (3.7×)
First Token Latency 800ms 250ms (3.2×) 230ms (3.5×)
Perplexity 5.68 5.82 (0.14 increase) 5.77 (0.09 increase)

Insight: AWQ gives better quality than GPTQ with similar speedup.


Part 10: Common Pitfalls and Solutions

Pitfall 1: Skipping Accuracy Validation

Issue: Deploy quantized model without measuring accuracy impact. Risk: Discover accuracy degradation in production (too late). Solution: Always validate accuracy on representative data before deployment.

# ❌ WRONG: Deploy without validation
quantized_model = quantize(model)
deploy(quantized_model)  # Hope it works!

# ✅ RIGHT: Validate before deployment
quantized_model = quantize(model)
results = validate_accuracy(original_model, quantized_model, val_loader)
if results['acceptable']:
    deploy(quantized_model)
else:
    print("Accuracy loss too large - try QAT")

Pitfall 2: Using Wrong Calibration Data

Issue: Calibrate with random/unrepresentative data. Risk: Activation ranges wrong → accuracy collapses. Solution: Use 100-1000 samples from validation set matching deployment distribution.

# ❌ WRONG: Random images from internet
calibration_data = download_random_images()

# ✅ RIGHT: Samples from validation set
calibration_data = torch.utils.data.Subset(val_dataset, range(1000))

Pitfall 3: Choosing Wrong Quantization Type

Issue: Use dynamic quantization when need static speedup. Risk: Get 1.2× speedup instead of 3× speedup. Solution: Match quantization type to requirements (dynamic for size, static for speed).

# ❌ WRONG: Use dynamic when need speed
if need_fast_cpu_inference:
    quantized_model = torch.quantization.quantize_dynamic(model)  # Only 1.2× speedup

# ✅ RIGHT: Use static for speed
if need_fast_cpu_inference:
    model = prepare_and_calibrate(model, calibration_data)
    quantized_model = torch.quantization.convert(model)  # 2-4× speedup

Pitfall 4: Quantizing GPU-Only Deployments

Issue: Quantize model for GPU inference without memory pressure. Risk: Effort not worth modest 1.5-2× GPU speedup. Solution: Only quantize GPU if memory-constrained (multiple models in VRAM).

# ❌ WRONG: Quantize for GPU with no memory issue
if deployment_target == "GPU" and have_plenty_of_memory:
    quantized_model = quantize(model)  # Wasted effort

# ✅ RIGHT: Skip quantization if not needed
if deployment_target == "GPU" and have_plenty_of_memory:
    deploy(model)  # Keep FP32, focus on other optimizations

Pitfall 5: Over-Quantizing (INT4 When INT8 Sufficient)

Issue: Use aggressive INT4 quantization when INT8 would suffice. Risk: Larger accuracy loss than necessary. Solution: Start with INT8 (standard), only use INT4 if extreme memory constraints.

# ❌ WRONG: Jump to INT4 without trying INT8
quantized_model = quantize(model, precision="INT4")  # 2-3% accuracy loss

# ✅ RIGHT: Start with INT8, only use INT4 if needed
quantized_model_int8 = quantize(model, precision="INT8")  # 0.5-1% accuracy loss
if model_still_too_large:
    quantized_model_int4 = quantize(model, precision="INT4")

Pitfall 6: Assuming All Layers Quantize Equally

Issue: Quantize all layers uniformly, but some layers more sensitive. Risk: Accuracy loss dominated by few sensitive layers. Solution: Use mixed precision - keep sensitive layers in FP32/INT8, quantize others to INT4.

# ✅ ADVANCED: Mixed precision quantization
# Keep first/last layers in higher precision, quantize middle layers aggressively
from torch.quantization import QConfigMapping

qconfig_mapping = QConfigMapping()
qconfig_mapping.set_global(get_default_qconfig('fbgemm'))  # INT8 default
qconfig_mapping.set_module_name('model.layer1', None)  # Keep first layer FP32
qconfig_mapping.set_module_name('model.layer10', None)  # Keep last layer FP32

model = quantize_with_qconfig(model, qconfig_mapping)

Part 11: Decision Framework Summary

Step 1: Recognize Quantization Need

Symptoms:

  • Model too slow on CPU (>10ms when need <5ms)
  • Model too large for edge devices (>50MB)
  • Deploying to CPU/edge (not GPU)
  • Need to reduce hosting costs

If YES → Proceed to Step 2 If NO → Don't quantize, focus on other optimizations

Step 2: Choose Quantization Type

Start with Dynamic:
├─ Sufficient? (meets latency/size requirements)
│  ├─ YES → Deploy dynamic quantized model
│  └─ NO → Proceed to Static
│
Static Quantization:
├─ Sufficient? (meets latency/size + accuracy acceptable)
│  ├─ YES → Deploy static quantized model
│  └─ NO → Accuracy loss >2%
│     │
│     └─ Proceed to QAT
│
QAT:
├─ Train with quantization awareness
└─ Achieves <1% accuracy loss → Deploy

Step 3: Calibrate (if Static/QAT)

Calibration data:

  • Source: Validation set (representative samples)
  • Size: 100-1000 samples
  • Characteristics: Match deployment distribution

Calibration process:

  1. Select samples from validation set
  2. Run through model to collect activation ranges
  3. Validate accuracy after calibration
  4. If accuracy loss >2%, try different calibration data or QAT

Step 4: Validate Accuracy

Required measurements:

  • Baseline accuracy (FP32)
  • Quantized accuracy (INT8/INT4)
  • Accuracy loss (baseline - quantized)
  • Acceptable threshold (typically <2%)

Decision:

  • If accuracy loss <2% → Deploy
  • If accuracy loss >2% → Try QAT or reconsider quantization

Step 5: Benchmark Performance

Required measurements:

  • Model size (MB): baseline vs quantized
  • Inference latency (ms): baseline vs quantized
  • Throughput (requests/sec): baseline vs quantized

Verify expected results:

  • Size: 4× reduction (FP32 → INT8)
  • CPU speedup: 2-4× (static quantization)
  • GPU speedup: 1.5-2× (if applicable)

Part 12: Production Deployment Checklist

Before deploying quantized model to production:

✅ Accuracy Validated

  • Baseline accuracy measured on validation set
  • Quantized accuracy measured on same validation set
  • Accuracy loss within acceptable threshold (<2%)
  • Validated on representative production data

✅ Performance Benchmarked

  • Size reduction measured (expect 4× for INT8)
  • Latency improvement measured (expect 2-4× CPU)
  • Throughput improvement measured
  • Performance meets requirements

✅ Calibration Verified (if static/QAT)

  • Used representative samples from validation set (not random data)
  • Used sufficient calibration data (100-1000 samples)
  • Calibration data matches deployment distribution

✅ Edge Cases Tested

  • Tested on diverse inputs (bright/dark images, long/short text)
  • Validated numerical stability (no NaN/Inf outputs)
  • Tested inference on target hardware (CPU/GPU/edge device)

✅ Rollback Plan

  • Can easily revert to FP32 model if issues found
  • Monitoring in place to detect accuracy degradation
  • A/B testing plan to compare FP32 vs quantized

Skill Mastery Checklist

You have mastered quantization for inference when you can:

  • Recognize when quantization is appropriate (CPU/edge deployment, size/speed issues)
  • Choose correct quantization type (dynamic vs static vs QAT) based on requirements
  • Implement dynamic quantization in PyTorch (5 lines of code)
  • Implement static quantization with proper calibration (20 lines of code)
  • Select appropriate calibration data (validation set, 100-1000 samples)
  • Validate accuracy trade-offs systematically (baseline vs quantized)
  • Benchmark performance improvements (size, latency, throughput)
  • Decide when NOT to quantize (GPU-only, already fast, accuracy-critical)
  • Debug quantization issues (accuracy collapse, wrong speedup, numerical instability)
  • Deploy quantized models to production with confidence

Key insight: Quantization is not magic - it's a systematic trade-off of precision for performance. The skill is matching the right quantization approach to your specific requirements.