| name | quantization |
| description | Model quantization for efficient inference and training. Covers precision types (FP32, FP16, BF16, INT8, INT4), BitsAndBytes configuration, memory estimation, and performance tradeoffs. |
Model Quantization
Overview
Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.
Quick Reference
| Precision | Bits | Memory | Quality | Speed |
|---|---|---|---|---|
| FP32 | 32 | 4x | Best | Slowest |
| FP16 | 16 | 2x | Excellent | Fast |
| BF16 | 16 | 2x | Excellent | Fast |
| INT8 | 8 | 1x | Good | Faster |
| INT4 | 4 | 0.5x | Acceptable | Fastest |
Memory Estimation
def estimate_memory(params_billions, precision_bits):
"""Estimate model memory in GB."""
bytes_per_param = precision_bits / 8
return params_billions * bytes_per_param
# Example: 7B model
model_size = 7 # billion parameters
print(f"FP32: {estimate_memory(7, 32):.1f} GB") # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB") # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB") # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB") # 3.5 GB
Measure Model Size
def get_model_size(model):
"""Get model size in GB including buffers."""
param_size = sum(p.numel() * p.element_size() for p in model.parameters())
buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
total = (param_size + buffer_size) / 1024**3
return total
print(f"Model size: {get_model_size(model):.2f} GB")
Load Model at Different Precisions
FP32 (Default)
from transformers import AutoModelForCausalLM
model_32bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
device_map="auto"
)
print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")
FP16 / BF16
import torch
model_16bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16, # or torch.bfloat16
device_map="auto"
)
print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")
8-bit Quantization
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True
)
model_8bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
quantization_config=quantization_config,
device_map="auto"
)
print(f"8-bit size: {get_model_size(model_8bit):.2f} GB")
4-bit Quantization (Recommended)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Nested quantization
)
model_4bit = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
quantization_config=quantization_config,
device_map="auto"
)
print(f"4-bit size: {get_model_size(model_4bit):.2f} GB")
BitsAndBytesConfig Options
4-bit Configuration
from transformers import BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_4bit=True,
# Quantization type
bnb_4bit_quant_type="nf4", # "nf4" or "fp4"
# Compute dtype for dequantized weights
bnb_4bit_compute_dtype=torch.bfloat16,
# Double quantization (saves more memory)
bnb_4bit_use_double_quant=True,
)
Options Explained
| Option | Values | Effect |
|---|---|---|
load_in_4bit |
True/False | Enable 4-bit |
bnb_4bit_quant_type |
"nf4", "fp4" | nf4 better for LLMs |
bnb_4bit_compute_dtype |
float16, bfloat16 | Computation precision |
bnb_4bit_use_double_quant |
True/False | Quantize quantization constants |
Compare Precision Performance
from transformers import pipeline
import time
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Test message
messages = [{"role": "user", "content": "Explain quantum computing."}]
def benchmark(model, tokenizer, name):
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
start = time.time()
output = pipe(messages, max_new_tokens=100, return_full_text=False)
elapsed = time.time() - start
print(f"{name}:")
print(f" Time: {elapsed:.2f}s")
print(f" Size: {get_model_size(model):.2f} GB")
print(f" Output: {output[0]['generated_text'][:50]}...")
print()
# Benchmark each precision
benchmark(model_32bit, tokenizer, "FP32")
benchmark(model_16bit, tokenizer, "FP16")
benchmark(model_8bit, tokenizer, "8-bit")
benchmark(model_4bit, tokenizer, "4-bit")
Quantization for Training
QLoRA Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit base model
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
quantization_config=quantization_config,
device_map="auto"
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Add LoRA adapters
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
Precision Comparison
| Precision | Memory | Quality | Training | Best For |
|---|---|---|---|---|
| FP32 | 4x | Perfect | Yes | Research, baselines |
| FP16 | 2x | Excellent | Yes | Standard training |
| BF16 | 2x | Excellent | Yes | Large models |
| INT8 | 1x | Good | Limited | Inference |
| INT4 | 0.5x | Acceptable | QLoRA | Memory-constrained |
FP16 vs BF16
| Aspect | FP16 | BF16 |
|---|---|---|
| Range | Smaller | Larger (like FP32) |
| Precision | Higher | Lower |
| Overflow risk | Higher | Lower |
| Hardware | All GPUs | Ampere+ |
| Best for | Inference | Training |
Troubleshooting
Out of Memory
Symptom: CUDA OOM error
Fix:
# Use 4-bit quantization
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True
)
Quality Degradation
Symptom: Poor model outputs after quantization
Fix:
- Use nf4 instead of fp4
- Try 8-bit instead of 4-bit
- Increase LoRA rank if fine-tuning
Slow Loading
Symptom: Model takes long to load
Fix:
- Quantization happens at load time
- Use
device_map="auto"for multi-GPU
When to Use This Skill
Use when:
- Model doesn't fit in GPU memory
- Need faster inference
- Training with limited resources (QLoRA)
- Deploying to edge devices
Cross-References
bazzite-ai-jupyter:peft- LoRA with quantization (QLoRA)bazzite-ai-jupyter:finetuning- Full fine-tuningbazzite-ai-jupyter:transformers- Model architecture