| name | peft |
| description | Parameter-efficient fine-tuning with LoRA and Unsloth. Covers LoraConfig, target module selection, QLoRA for 4-bit training, adapter merging, and Unsloth optimizations for 2x faster training. |
Parameter-Efficient Fine-Tuning (PEFT)
Overview
PEFT methods like LoRA train only a small number of adapter parameters instead of the full model, reducing memory by 10-100x while maintaining quality.
Quick Reference
| Method | Memory | Speed | Quality |
|---|---|---|---|
| Full Fine-tune | High | Slow | Best |
| LoRA | Low | Fast | Very Good |
| QLoRA | Very Low | Fast | Good |
| Unsloth | Very Low | 2x Faster | Good |
LoRA Concepts
How LoRA Works
Original weight matrix W (frozen): d x k
LoRA adapters A and B: d x r, r x k (where r << min(d,k))
Forward pass:
output = x @ W + x @ A @ B * (alpha / r)
Trainable params: 2 * r * d (instead of d * k)
Memory Savings
def lora_savings(d, k, r):
original = d * k
lora = 2 * r * max(d, k)
reduction = (1 - lora / original) * 100
return reduction
# Example: 4096 x 4096 matrix with rank 8
print(f"Memory reduction: {lora_savings(4096, 4096, 8):.1f}%")
# Output: ~99.6% reduction
Basic LoRA Setup
Configure LoRA
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=8, # Rank (capacity)
lora_alpha=16, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers
lora_dropout=0.05, # Regularization
bias="none", # Don't train biases
task_type=TaskType.CAUSAL_LM # Task type
)
Apply to Model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
device_map="auto"
)
model = get_peft_model(model, lora_config)
# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,100,048,384 || trainable%: 0.38%
LoRA Parameters
Key Parameters
| Parameter | Values | Effect |
|---|---|---|
r |
4, 8, 16, 32 | Adapter capacity |
lora_alpha |
r to 2*r | Scaling (higher = stronger) |
target_modules |
List | Which layers to adapt |
lora_dropout |
0.0-0.1 | Regularization |
Target Modules
# Common target modules for different models
# LLaMA / Mistral / TinyLlama
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# GPT-2
target_modules = ["c_attn", "c_proj"]
# BLOOM
target_modules = ["query_key_value", "dense"]
# All linear layers (most aggressive)
target_modules = "all-linear"
Rank Selection Guide
| Rank (r) | Use Case |
|---|---|
| 4 | Simple tasks, small datasets |
| 8 | General purpose (recommended) |
| 16 | Complex tasks, more capacity |
| 32+ | Near full fine-tune quality |
QLoRA (Quantized LoRA)
Setup
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
quantization_config=quantization_config,
device_map="auto"
)
# Prepare for k-bit training (important!)
model = prepare_model_for_kbit_training(model)
# Add LoRA adapters
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Training with PEFT
Using SFTTrainer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("timdettmers/openassistant-guanaco")
sft_config = SFTConfig(
output_dir="./lora_checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4, # Higher LR for LoRA
logging_steps=10,
save_steps=500,
max_seq_length=512,
gradient_accumulation_steps=4,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset["train"],
tokenizer=tokenizer,
dataset_text_field="text",
peft_config=lora_config, # Pass LoRA config
)
trainer.train()
Unsloth (2x Faster Training)
Setup
from unsloth import FastLanguageModel
# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/tinyllama-chat-bnb-4bit", # Pre-quantized
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True,
)
# Add LoRA with Unsloth
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True,
random_state=42,
)
Train with Unsloth
from trl import SFTTrainer, SFTConfig
sft_config = SFTConfig(
output_dir="./unsloth_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=100,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit", # Memory-efficient optimizer
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=sft_config,
)
trainer.train()
Save and Load Adapters
Save Adapters Only
# Save just the LoRA weights (small!)
model.save_pretrained("./lora_adapters")
Load Adapters
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./lora_adapters")
Merge Adapters into Base Model
# Merge LoRA weights into base model (for deployment)
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./merged_model")
Inference with Adapters
from peft import PeftModel
# Load base + adapters
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = PeftModel.from_pretrained(base_model, "./lora_adapters")
# Generate
model.eval()
inputs = tokenizer("What is Python?", return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Comparison: Full vs LoRA vs QLoRA
| Aspect | Full Fine-tune | LoRA | QLoRA |
|---|---|---|---|
| Trainable % | 100% | ~0.1-1% | ~0.1-1% |
| Memory | 4x model | ~1.2x model | ~0.5x model |
| Training speed | Slow | Fast | Fast |
| Quality | Best | Very Good | Good |
| 7B model | 28GB+ | ~16GB | ~6GB |
Troubleshooting
Out of Memory
Fix:
# Use gradient checkpointing
model.gradient_checkpointing_enable()
# Use smaller batch with accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=8
Poor Quality
Fix:
- Increase
r(rank) - Add more target modules
- Train longer
- Check data quality
NaN Loss
Fix:
- Lower learning rate
- Use gradient clipping
- Check for data issues
When to Use This Skill
Use when:
- GPU memory is limited
- Fine-tuning large models (7B+)
- Need fast training iterations
- Want to swap adapters for different tasks
Cross-References
bazzite-ai-jupyter:finetuning- Full fine-tuning basicsbazzite-ai-jupyter:quantization- Quantization for QLoRAbazzite-ai-jupyter:transformers- Target module selection