Claude Code Plugins

Community-maintained marketplace

Feedback

|

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name peft
description Parameter-efficient fine-tuning with LoRA and Unsloth. Covers LoraConfig, target module selection, QLoRA for 4-bit training, adapter merging, and Unsloth optimizations for 2x faster training.

Parameter-Efficient Fine-Tuning (PEFT)

Overview

PEFT methods like LoRA train only a small number of adapter parameters instead of the full model, reducing memory by 10-100x while maintaining quality.

Quick Reference

Method Memory Speed Quality
Full Fine-tune High Slow Best
LoRA Low Fast Very Good
QLoRA Very Low Fast Good
Unsloth Very Low 2x Faster Good

LoRA Concepts

How LoRA Works

Original weight matrix W (frozen):     d x k
LoRA adapters A and B:                 d x r, r x k (where r << min(d,k))

Forward pass:
  output = x @ W + x @ A @ B * (alpha / r)

Trainable params: 2 * r * d  (instead of d * k)

Memory Savings

def lora_savings(d, k, r):
    original = d * k
    lora = 2 * r * max(d, k)
    reduction = (1 - lora / original) * 100
    return reduction

# Example: 4096 x 4096 matrix with rank 8
print(f"Memory reduction: {lora_savings(4096, 4096, 8):.1f}%")
# Output: ~99.6% reduction

Basic LoRA Setup

Configure LoRA

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,                          # Rank (capacity)
    lora_alpha=16,                # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers
    lora_dropout=0.05,            # Regularization
    bias="none",                  # Don't train biases
    task_type=TaskType.CAUSAL_LM  # Task type
)

Apply to Model

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

model = get_peft_model(model, lora_config)

# Check trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 1,100,048,384 || trainable%: 0.38%

LoRA Parameters

Key Parameters

Parameter Values Effect
r 4, 8, 16, 32 Adapter capacity
lora_alpha r to 2*r Scaling (higher = stronger)
target_modules List Which layers to adapt
lora_dropout 0.0-0.1 Regularization

Target Modules

# Common target modules for different models

# LLaMA / Mistral / TinyLlama
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# GPT-2
target_modules = ["c_attn", "c_proj"]

# BLOOM
target_modules = ["query_key_value", "dense"]

# All linear layers (most aggressive)
target_modules = "all-linear"

Rank Selection Guide

Rank (r) Use Case
4 Simple tasks, small datasets
8 General purpose (recommended)
16 Complex tasks, more capacity
32+ Near full fine-tune quality

QLoRA (Quantized LoRA)

Setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

# Prepare for k-bit training (important!)
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Training with PEFT

Using SFTTrainer

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

sft_config = SFTConfig(
    output_dir="./lora_checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,  # Higher LR for LoRA
    logging_steps=10,
    save_steps=500,
    max_seq_length=512,
    gradient_accumulation_steps=4,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    dataset_text_field="text",
    peft_config=lora_config,  # Pass LoRA config
)

trainer.train()

Unsloth (2x Faster Training)

Setup

from unsloth import FastLanguageModel

# Load model with Unsloth optimizations
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/tinyllama-chat-bnb-4bit",  # Pre-quantized
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# Add LoRA with Unsloth
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=42,
)

Train with Unsloth

from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="./unsloth_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=100,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",  # Memory-efficient optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=sft_config,
)

trainer.train()

Save and Load Adapters

Save Adapters Only

# Save just the LoRA weights (small!)
model.save_pretrained("./lora_adapters")

Load Adapters

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

model = PeftModel.from_pretrained(base_model, "./lora_adapters")

Merge Adapters into Base Model

# Merge LoRA weights into base model (for deployment)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./merged_model")

Inference with Adapters

from peft import PeftModel

# Load base + adapters
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = PeftModel.from_pretrained(base_model, "./lora_adapters")

# Generate
model.eval()
inputs = tokenizer("What is Python?", return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0]))

Comparison: Full vs LoRA vs QLoRA

Aspect Full Fine-tune LoRA QLoRA
Trainable % 100% ~0.1-1% ~0.1-1%
Memory 4x model ~1.2x model ~0.5x model
Training speed Slow Fast Fast
Quality Best Very Good Good
7B model 28GB+ ~16GB ~6GB

Troubleshooting

Out of Memory

Fix:

# Use gradient checkpointing
model.gradient_checkpointing_enable()

# Use smaller batch with accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=8

Poor Quality

Fix:

  • Increase r (rank)
  • Add more target modules
  • Train longer
  • Check data quality

NaN Loss

Fix:

  • Lower learning rate
  • Use gradient clipping
  • Check for data issues

When to Use This Skill

Use when:

  • GPU memory is limited
  • Fine-tuning large models (7B+)
  • Need fast training iterations
  • Want to swap adapters for different tasks

Cross-References

  • bazzite-ai-jupyter:finetuning - Full fine-tuning basics
  • bazzite-ai-jupyter:quantization - Quantization for QLoRA
  • bazzite-ai-jupyter:transformers - Target module selection