| name | huggingface-transformers |
| description | Hugging Face Transformers best practices including model loading, tokenization, fine-tuning workflows, and inference optimization. Use when working with transformer models, fine-tuning LLMs, implementing NLP tasks, or optimizing transformer inference. |
Hugging Face Transformers Best Practices
Comprehensive guide to using the Hugging Face Transformers library including model loading, tokenization, fine-tuning workflows, pipeline usage, custom datasets, and deployment optimization.
Quick Reference
When to use this skill:
- Loading and using pre-trained transformers (BERT, GPT, T5, LLaMA, etc.)
- Fine-tuning models on custom data
- Implementing NLP tasks (classification, QA, generation, etc.)
- Optimizing inference (quantization, ONNX, etc.)
- Debugging tokenization issues
- Using Hugging Face pipelines
- Deploying transformers to production
Models covered:
- Encoders: BERT, RoBERTa, DeBERTa, ALBERT
- Decoders: GPT-2, GPT-Neo, LLaMA, Mistral
- Encoder-Decoders: T5, BART, Flan-T5
- Vision: ViT, CLIP, Stable Diffusion
Part 1: Model Loading Patterns
Pattern 1: Basic Model Loading
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# For specific tasks
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3 # For 3-class classification
)
Pattern 2: Loading with Specific Configuration
from transformers import AutoConfig, AutoModel
# Modify configuration
config = AutoConfig.from_pretrained("bert-base-uncased")
config.hidden_dropout_prob = 0.2 # Custom dropout
config.attention_probs_dropout_prob = 0.2
# Load model with custom config
model = AutoModel.from_pretrained("bert-base-uncased", config=config)
# Or create model from scratch with config
model = AutoModel.from_config(config)
Pattern 3: Loading Quantized Models (Memory Efficient)
from transformers import AutoModel, BitsAndBytesConfig
import torch
# 8-bit quantization (50% memory reduction)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto" # Automatic device placement
)
# 4-bit quantization (75% memory reduction)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=quantization_config,
device_map="auto"
)
Pattern 4: Loading from Local Path
# Save model locally
model.save_pretrained("./my-model")
tokenizer.save_pretrained("./my-model")
# Load from local path
model = AutoModel.from_pretrained("./my-model")
tokenizer = AutoTokenizer.from_pretrained("./my-model")
Part 2: Tokenization Best Practices
Critical Tokenization Patterns
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# ✅ CORRECT: All required arguments
tokens = tokenizer(
text,
padding=True, # Pad to longest in batch
truncation=True, # Truncate to max_length
max_length=512, # Maximum sequence length
return_tensors="pt" # Return PyTorch tensors
)
# Access components
input_ids = tokens['input_ids'] # Token IDs
attention_mask = tokens['attention_mask'] # Padding mask
token_type_ids = tokens.get('token_type_ids') # Segment IDs (BERT)
# ❌ WRONG: Missing critical arguments
tokens = tokenizer(text) # No padding, truncation, or tensor format!
Batch Tokenization
# Tokenize multiple texts efficiently
texts = ["First text", "Second text", "Third text"]
tokens = tokenizer(
texts,
padding=True, # Pad all to longest in batch
truncation=True,
max_length=128,
return_tensors="pt"
)
# Result shape: [batch_size, max_length]
print(tokens['input_ids'].shape) # torch.Size([3, max_len_in_batch])
Special Token Handling
# Add special tokens
tokenizer.add_special_tokens({
'additional_special_tokens': ['[CUSTOM]', '[MARKER]']
})
# Resize model embeddings to match
model.resize_token_embeddings(len(tokenizer))
# Encode with special tokens preserved
text = "Hello [CUSTOM] world"
tokens = tokenizer(text, add_special_tokens=True)
# Decode
decoded = tokenizer.decode(tokens['input_ids'][0], skip_special_tokens=False)
Tokenization for Different Tasks
# Text classification (single sequence)
tokens = tokenizer(
"This movie was great!",
padding="max_length",
truncation=True,
max_length=128,
return_tensors="pt"
)
# Question answering (pair of sequences)
question = "What is the capital of France?"
context = "France is a country in Europe. Paris is its capital."
tokens = tokenizer(
question,
context,
padding="max_length",
truncation="only_second", # Only truncate context
max_length=384,
return_tensors="pt"
)
# Text generation (decoder-only models)
prompt = "Once upon a time"
tokens = tokenizer(prompt, return_tensors="pt")
# No padding needed for generation input
Part 3: Fine-Tuning Workflows
Pattern 1: Simple Fine-Tuning with Trainer
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# 1. Load dataset
dataset = load_dataset("glue", "mrpc")
# 2. Load model
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 3. Tokenize dataset
def tokenize_function(examples):
return tokenizer(
examples["sentence1"],
examples["sentence2"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# 4. Define training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_dir="./logs",
logging_steps=100,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
# 5. Define metrics
from datasets import load_metric
import numpy as np
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# 6. Create Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
# 7. Train
trainer.train()
# 8. Save
trainer.save_model("./fine-tuned-model")
Pattern 2: LoRA Fine-Tuning (Parameter-Efficient)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True, # 8-bit for memory efficiency
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=32, # LoRA alpha
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"], # Which layers to adapt
)
# Apply LoRA
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4.2M || all params: 6.7B || trainable%: 0.062%
# Train with Trainer (same as before)
# Only LoRA parameters are updated!
Pattern 3: Custom Training Loop
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
# Prepare dataloaders
train_dataloader = DataLoader(tokenized_datasets["train"], batch_size=16, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=16)
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
# Learning rate scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=500,
num_training_steps=num_training_steps
)
# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
for epoch in range(num_epochs):
model.train()
for batch in train_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Evaluation
model.eval()
for batch in eval_dataloader:
batch = {k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
# Compute metrics
Part 4: Pipeline Usage (High-Level API)
Text Classification Pipeline
from transformers import pipeline
# Load pipeline
classifier = pipeline(
"text-classification",
model="distilbert-base-uncased-finetuned-sst-2-english"
)
# Single prediction
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch prediction
results = classifier([
"Great service!",
"Terrible experience",
"Average quality"
])
Question Answering Pipeline
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
result = qa_pipeline(
question="What is the capital of France?",
context="France is a country in Europe. Its capital is Paris, a beautiful city."
)
# {'score': 0.98, 'start': 49, 'end': 54, 'answer': 'Paris'}
Text Generation Pipeline
generator = pipeline("text-generation", model="gpt2")
outputs = generator(
"Once upon a time",
max_length=50,
num_return_sequences=3,
temperature=0.7,
top_k=50,
top_p=0.95,
do_sample=True
)
for output in outputs:
print(output['generated_text'])
Zero-Shot Classification Pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
result = classifier(
"This is a course about Python programming.",
candidate_labels=["education", "technology", "business", "sports"]
)
# {'sequence': '...', 'labels': ['education', 'technology', ...], 'scores': [0.85, 0.12, ...]}
Part 5: Inference Optimization
Optimization 1: Batch Processing
# ❌ SLOW: Process one at a time
for text in texts:
output = model(**tokenizer(text, return_tensors="pt"))
# ✅ FAST: Process in batches
batch_size = 32
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
Optimization 2: Mixed Precision (AMP)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
optimizer.zero_grad()
# Forward pass in mixed precision
with autocast():
outputs = model(**batch)
loss = outputs.loss
# Backward pass with scaled gradients
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Optimization 3: ONNX Export
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
# Export to ONNX
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model.save_pretrained("./onnx-model", export=True)
# Load ONNX model (faster inference)
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx-model")
# Inference (2-3x faster)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = ort_model(**inputs)
Optimization 4: Dynamic Quantization
import torch
# Quantize model to int8
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear}, # Quantize Linear layers
dtype=torch.qint8
)
# 4x smaller model, 2-3x faster inference on CPU
Part 6: Common Issues & Solutions
Issue 1: CUDA Out of Memory
Problem: RuntimeError: CUDA out of memory
Solutions:
# Solution 1: Reduce batch size
training_args = TrainingArguments(
per_device_train_batch_size=8, # Was 32
gradient_accumulation_steps=4, # Effective batch = 8*4 = 32
)
# Solution 2: Use gradient checkpointing
model.gradient_checkpointing_enable()
# Solution 3: Use 8-bit model
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained("model-name", quantization_config=quantization_config)
# Solution 4: Clear cache
import torch
torch.cuda.empty_cache()
Issue 2: Slow Tokenization
Problem: Tokenization is bottleneck
Solutions:
# Solution 1: Use fast tokenizers
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
# Solution 2: Tokenize dataset once, cache it
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
num_proc=4, # Parallel processing
remove_columns=dataset.column_names,
load_from_cache_file=True # Cache results
)
# Solution 3: Use larger batches for tokenization
tokenizer(
texts,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt",
batched=True, # Process multiple texts at once
batch_size=1000
)
Issue 3: Inconsistent Results
Problem: Model outputs different results for same input
Solution:
# Set seeds for reproducibility
import random
import numpy as np
import torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
# Disable dropout during inference
model.eval()
# Use deterministic generation
outputs = model.generate(
inputs,
do_sample=False, # Greedy decoding
# OR
do_sample=True,
temperature=1.0,
top_k=50,
seed=42 # For sampling
)
Issue 4: Attention Mask Errors
Problem: IndexError: index out of range in self
Solution:
# ✅ ALWAYS provide attention mask
tokens = tokenizer(
text,
padding=True,
truncation=True,
return_tensors="pt",
return_attention_mask=True # Explicit (usually default)
)
# Use it in model forward
outputs = model(
input_ids=tokens['input_ids'],
attention_mask=tokens['attention_mask'] # Don't forget this!
)
# For custom padding
attention_mask = (input_ids != tokenizer.pad_token_id).long()
Part 7: Model-Specific Patterns
GPT Models (Decoder-Only)
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Set pad token (GPT doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
# Generation
input_text = "The future of AI is"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=50,
num_beams=5, # Beam search
early_stopping=True,
no_repeat_ngram_size=2, # Prevent repetition
temperature=0.8,
top_p=0.9
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
T5 Models (Encoder-Decoder)
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
# T5 expects task prefix
input_text = "translate English to German: How are you?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=50
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# "Wie geht es dir?"
BERT Models (Encoder-Only)
from transformers import BertForMaskedLM, BertTokenizer
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# Masked language modeling
text = "Paris is the [MASK] of France."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions for [MASK]
outputs = model(**inputs)
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = outputs.logits[0, mask_token_index, :]
# Top 5 predictions
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(tokenizer.decode([token]))
# capital, city, center, heart, ...
Part 8: Production Deployment
FastAPI Serving Pattern
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
import uvicorn
app = FastAPI()
# Load model once at startup
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
class TextInput(BaseModel):
text: str
@app.post("/classify")
async def classify_text(input: TextInput):
result = classifier(input.text)[0]
return {
"label": result['label'],
"confidence": result['score']
}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
Batch Inference Optimization
import asyncio
from typing import List
class BatchPredictor:
def __init__(self, model, tokenizer, max_batch_size=32):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.queue = []
self.lock = asyncio.Lock()
async def predict(self, text: str):
async with self.lock:
future = asyncio.Future()
self.queue.append((text, future))
if len(self.queue) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
if not self.queue:
return
texts, futures = zip(*self.queue)
self.queue = []
# Process batch
inputs = self.tokenizer(list(texts), padding=True, truncation=True, return_tensors="pt")
outputs = self.model(**inputs)
results = outputs.logits.argmax(dim=-1).tolist()
# Return results
for future, result in zip(futures, results):
future.set_result(result)
Quick Decision Trees
"Which model should I use?"
Task type?
Classification → BERT, RoBERTa, DeBERTa
Generation → GPT-2, GPT-Neo, LLaMA
Translation/Summarization → T5, BART, mT5
Question Answering → BERT, DeBERTa, RoBERTa
Performance vs Speed?
Best performance → Large models (355M+ params)
Balanced → Base models (110M params)
Fast inference → Distilled models (66M params)
"How should I fine-tune?"
Have full dataset control?
YES → Full fine-tuning or LoRA
NO → Few-shot prompting
Dataset size?
Large (>10K examples) → Full fine-tuning
Medium (1K-10K) → LoRA or full fine-tuning
Small (<1K) → LoRA or prompt engineering
Compute available?
Limited → LoRA (4-bit quantized)
Moderate → LoRA (8-bit)
High → Full fine-tuning
Resources
- Hugging Face Docs: https://huggingface.co/docs/transformers/
- Model Hub: https://huggingface.co/models
- PEFT (LoRA): https://huggingface.co/docs/peft/
- Optimum: https://huggingface.co/docs/optimum/
- Datasets: https://huggingface.co/docs/datasets/
Skill version: 1.0.0 Last updated: 2025-10-25 Maintained by: Applied Artificial Intelligence