| name | gptq |
| description | Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning. |
GPTQ (Generative Pre-trained Transformer Quantization)
Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization.
When to use GPTQ
Use GPTQ when:
- Need to fit large models (70B+) on limited GPU memory
- Want 4× memory reduction with <2% accuracy loss
- Deploying on consumer GPUs (RTX 4090, 3090)
- Need faster inference (3-4× speedup vs FP16)
Use AWQ instead when:
- Need slightly better accuracy (<1% loss)
- Have newer GPUs (Ampere, Ada)
- Want Marlin kernel support (2× faster on some GPUs)
Use bitsandbytes instead when:
- Need simple integration with transformers
- Want 8-bit quantization (less compression, better quality)
- Don't need pre-quantized model files
Quick start
Installation
# Install AutoGPTQ
pip install auto-gptq
# With Triton (Linux only, faster)
pip install auto-gptq[triton]
# With CUDA extensions (faster)
pip install auto-gptq --no-build-isolation
# Full installation
pip install auto-gptq transformers accelerate
Load pre-quantized model
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
# Load quantized model from HuggingFace
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=False # Set True on Linux for speed
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Generate
prompt = "Explain quantum computing"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
Quantize your own model
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
# Load model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Quantization config
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Group size (recommended: 128)
desc_act=False, # Activation order (False for CUDA kernel)
damp_percent=0.01 # Dampening factor
)
# Load model for quantization
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=quantize_config
)
# Prepare calibration data
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
# Quantize
model.quantize(calibration_data)
# Save quantized model
model.save_quantized("llama-2-7b-gptq")
tokenizer.save_pretrained("llama-2-7b-gptq")
# Push to HuggingFace
model.push_to_hub("username/llama-2-7b-gptq")
Group-wise quantization
How GPTQ works:
- Group weights: Divide each weight matrix into groups (typically 128 elements)
- Quantize per-group: Each group has its own scale/zero-point
- Minimize error: Uses Hessian information to minimize quantization error
- Result: 4-bit weights with near-FP16 accuracy
Group size trade-off:
| Group Size | Model Size | Accuracy | Speed | Recommendation |
|---|---|---|---|---|
| -1 (per-column) | Smallest | Best | Slowest | Research only |
| 32 | Smaller | Better | Slower | High accuracy needed |
| 128 | Medium | Good | Fast | Recommended default |
| 256 | Larger | Lower | Faster | Speed critical |
| 1024 | Largest | Lowest | Fastest | Not recommended |
Example:
Weight matrix: [1024, 4096] = 4.2M elements
Group size = 128:
- Groups: 4.2M / 128 = 32,768 groups
- Each group: own 4-bit scale + zero-point
- Result: Better granularity → better accuracy
Quantization configurations
Standard 4-bit (recommended)
from auto_gptq import BaseQuantizeConfig
config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Standard group size
desc_act=False, # Faster CUDA kernel
damp_percent=0.01 # Dampening factor
)
Performance:
- Memory: 4× reduction (70B model: 140GB → 35GB)
- Accuracy: ~1.5% perplexity increase
- Speed: 3-4× faster than FP16
High accuracy (3-bit with larger groups)
config = BaseQuantizeConfig(
bits=3, # 3-bit (more compression)
group_size=128, # Keep standard group size
desc_act=True, # Better accuracy (slower)
damp_percent=0.01
)
Trade-off:
- Memory: 5× reduction
- Accuracy: ~3% perplexity increase
- Speed: 5× faster (but less accurate)
Maximum accuracy (4-bit with small groups)
config = BaseQuantizeConfig(
bits=4,
group_size=32, # Smaller groups (better accuracy)
desc_act=True, # Activation reordering
damp_percent=0.005 # Lower dampening
)
Trade-off:
- Memory: 3.5× reduction (slightly larger)
- Accuracy: ~0.8% perplexity increase (best)
- Speed: 2-3× faster (kernel overhead)
Kernel backends
ExLlamaV2 (default, fastest)
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_exllama=True, # Use ExLlamaV2
exllama_config={"version": 2}
)
Performance: 1.5-2× faster than Triton
Marlin (Ampere+ GPUs)
# Quantize with Marlin format
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False # Required for Marlin
)
model.quantize(calibration_data, use_marlin=True)
# Load with Marlin
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_marlin=True # 2× faster on A100/H100
)
Requirements:
- NVIDIA Ampere or newer (A100, H100, RTX 40xx)
- Compute capability ≥ 8.0
Triton (Linux only)
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device="cuda:0",
use_triton=True # Linux only
)
Performance: 1.2-1.5× faster than CUDA backend
Integration with transformers
Direct transformers usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load quantized model (transformers auto-detects GPTQ)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-13B-Chat-GPTQ",
device_map="auto",
trust_remote_code=False
)
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-GPTQ")
# Use like any transformers model
inputs = tokenizer("Hello", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
QLoRA fine-tuning (GPTQ + LoRA)
from transformers import AutoModelForCausalLM
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
# Load GPTQ model
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto"
)
# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Add LoRA adapters
model = get_peft_model(model, lora_config)
# Fine-tune (memory efficient!)
# 70B model trainable on single A100 80GB
Performance benchmarks
Memory reduction
| Model | FP16 | GPTQ 4-bit | Reduction |
|---|---|---|---|
| Llama 2-7B | 14 GB | 3.5 GB | 4× |
| Llama 2-13B | 26 GB | 6.5 GB | 4× |
| Llama 2-70B | 140 GB | 35 GB | 4× |
| Llama 3-405B | 810 GB | 203 GB | 4× |
Enables:
- 70B on single A100 80GB (vs 2× A100 needed for FP16)
- 405B on 3× A100 80GB (vs 11× A100 needed for FP16)
- 13B on RTX 4090 24GB (vs OOM with FP16)
Inference speed (Llama 2-7B, A100)
| Precision | Tokens/sec | vs FP16 |
|---|---|---|
| FP16 | 25 tok/s | 1× |
| GPTQ 4-bit (CUDA) | 85 tok/s | 3.4× |
| GPTQ 4-bit (ExLlama) | 105 tok/s | 4.2× |
| GPTQ 4-bit (Marlin) | 120 tok/s | 4.8× |
Accuracy (perplexity on WikiText-2)
| Model | FP16 | GPTQ 4-bit (g=128) | Degradation |
|---|---|---|---|
| Llama 2-7B | 5.47 | 5.55 | +1.5% |
| Llama 2-13B | 4.88 | 4.95 | +1.4% |
| Llama 2-70B | 3.32 | 3.38 | +1.8% |
Excellent quality preservation - less than 2% degradation!
Common patterns
Multi-GPU deployment
# Automatic device mapping
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-GPTQ",
device_map="auto", # Automatically split across GPUs
max_memory={0: "40GB", 1: "40GB"} # Limit per GPU
)
# Manual device mapping
device_map = {
"model.embed_tokens": 0,
"model.layers.0-39": 0, # First 40 layers on GPU 0
"model.layers.40-79": 1, # Last 40 layers on GPU 1
"model.norm": 1,
"lm_head": 1
}
model = AutoGPTQForCausalLM.from_quantized(
model_name,
device_map=device_map
)
CPU offloading
# Offload some layers to CPU (for very large models)
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-405B-GPTQ",
device_map="auto",
max_memory={
0: "80GB", # GPU 0
1: "80GB", # GPU 1
2: "80GB", # GPU 2
"cpu": "200GB" # Offload overflow to CPU
}
)
Batch inference
# Process multiple prompts efficiently
prompts = [
"Explain AI",
"Explain ML",
"Explain DL"
]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=100,
pad_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {tokenizer.decode(output)}")
Finding pre-quantized models
TheBloke on HuggingFace:
- https://huggingface.co/TheBloke
- 1000+ models in GPTQ format
- Multiple group sizes (32, 128)
- Both CUDA and Marlin formats
Search:
# Find GPTQ models on HuggingFace
https://huggingface.co/models?library=gptq
Download:
from auto_gptq import AutoGPTQForCausalLM
# Automatically downloads from HuggingFace
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-2-70B-Chat-GPTQ",
device="cuda:0"
)
Supported models
- LLaMA family: Llama 2, Llama 3, Code Llama
- Mistral: Mistral 7B, Mixtral 8x7B, 8x22B
- Qwen: Qwen, Qwen2, QwQ
- DeepSeek: V2, V3
- Phi: Phi-2, Phi-3
- Yi, Falcon, BLOOM, OPT
- 100+ models on HuggingFace
References
- Calibration Guide - Dataset selection, quantization process, quality optimization
- Integration Guide - Transformers, PEFT, vLLM, TensorRT-LLM
- Troubleshooting - Common issues, performance optimization
Resources
- GitHub: https://github.com/AutoGPTQ/AutoGPTQ
- Paper: GPTQ: Accurate Post-Training Quantization (arXiv:2210.17323)
- Models: https://huggingface.co/models?library=gptq
- Discord: https://discord.gg/autogptq