| name | sft |
| description | Supervised Fine-Tuning with SFTTrainer and Unsloth. Covers dataset preparation, chat template formatting, training configuration, and Unsloth optimizations for 2x faster instruction tuning. Includes thinking model patterns. |
Supervised Fine-Tuning (SFT)
Overview
SFT adapts a pre-trained LLM to follow instructions by training on instruction-response pairs. Unsloth provides an optimized SFTTrainer for 2x faster training with reduced memory usage. This skill includes patterns for training thinking/reasoning models.
Quick Reference
| Component | Purpose |
|---|---|
FastLanguageModel |
Load model with Unsloth optimizations |
SFTTrainer |
Trainer for instruction tuning |
SFTConfig |
Training hyperparameters |
dataset_text_field |
Column containing formatted text |
| Token ID 151668 | </think> boundary for Qwen3-Thinking models |
Critical Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
Critical Import Order
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
# Then other imports
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
import torch
Warning: Importing TRL before Unsloth will disable optimizations and may cause errors.
Dataset Formats
Instruction-Response Format
dataset = [
{"instruction": "What is Python?", "response": "A programming language."},
{"instruction": "Explain ML.", "response": "Machine learning is..."},
]
Chat/Conversation Format
dataset = [
{"messages": [
{"role": "user", "content": "What is Python?"},
{"role": "assistant", "content": "A programming language."}
]},
]
Using Chat Templates
def format_conversation(sample):
messages = [
{"role": "user", "content": sample["instruction"]},
{"role": "assistant", "content": sample["response"]}
]
return {"text": tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)}
dataset = dataset.map(format_conversation)
Thinking Model Format
For models like Qwen3-Thinking, include <think> tags in the assistant response. Use self-questioning internal dialogue style:
def format_thinking_conversation(sample):
"""Format with thinking/reasoning tags."""
# Combine thinking and response with tags
assistant_content = f"<think>\n{sample['thinking']}\n</think>\n\n{sample['response']}"
messages = [
{"role": "user", "content": sample["instruction"]},
{"role": "assistant", "content": assistant_content}
]
return {"text": tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)}
# Sample dataset with self-questioning thinking style
thinking_data = [
{
"instruction": "What is machine learning?",
"thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
"response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
},
{
"instruction": "Explain Python in one sentence.",
"thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
"response": "Python is a high-level programming language known for its readability and versatility."
},
{
"instruction": "What is a neural network?",
"thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
"response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
},
]
dataset = Dataset.from_list(thinking_data)
dataset = dataset.map(format_thinking_conversation, remove_columns=["instruction", "thinking", "response"])
Thinking Style Patterns:
- "What is the user asking here?"
- "Let me think about the key concepts..."
- "How should I structure this explanation?"
- "What's most important about X?"
Unsloth SFT Setup
Load Model
from unsloth import FastLanguageModel
# Standard model
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-4B-unsloth-bnb-4bit",
max_seq_length=512,
load_in_4bit=True,
)
# Thinking model (for reasoning tasks)
model, tokenizer = FastLanguageModel.from_pretrained(
"unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
max_seq_length=1024, # Increased for thinking content
load_in_4bit=True,
)
Apply LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
Training Configuration
from trl import SFTConfig
sft_config = SFTConfig(
output_dir="./sft_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
max_steps=100,
learning_rate=2e-4,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_seq_length=512,
)
SFTTrainer Usage
Basic Training
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
args=sft_config,
)
trainer.train()
With Custom Formatting
def formatting_func(examples):
texts = []
for instruction, response in zip(examples["instruction"], examples["response"]):
text = f"### Instruction:\n{instruction}\n\n### Response:\n{response}"
texts.append(text)
return texts
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
formatting_func=formatting_func,
args=sft_config,
)
Key Parameters
| Parameter | Typical Values | Effect |
|---|---|---|
learning_rate |
2e-4 to 2e-5 | Training speed vs stability |
per_device_train_batch_size |
1-4 | Memory usage |
gradient_accumulation_steps |
2-8 | Effective batch size |
max_seq_length |
512-2048 | Context window |
optim |
"adamw_8bit" | Memory-efficient optimizer |
Save and Load
Save Model
# Save LoRA adapters only (small)
model.save_pretrained("./sft_lora")
# Save merged model (full size)
model.save_pretrained_merged("./sft_merged", tokenizer)
Load for Inference
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained("./sft_lora")
FastLanguageModel.for_inference(model)
Thinking Model Inference
Parse thinking content from model output using token IDs:
THINK_END_TOKEN_ID = 151668 # </think> token for Qwen3-Thinking
def generate_with_thinking(model, tokenizer, prompt):
"""Generate and parse thinking model output."""
messages = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
# Setup pad token if needed
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
outputs = model.generate(
input_ids=inputs,
max_new_tokens=1024,
temperature=0.6,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
# Extract only generated tokens
input_length = inputs.shape[1]
generated_ids = outputs[0][input_length:].tolist()
# Parse thinking and response
if THINK_END_TOKEN_ID in generated_ids:
end_idx = generated_ids.index(THINK_END_TOKEN_ID)
thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True)
response = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True)
else:
thinking = tokenizer.decode(generated_ids, skip_special_tokens=True)
response = "(incomplete - increase max_new_tokens)"
return thinking.strip(), response.strip()
# Usage
FastLanguageModel.for_inference(model)
thinking, response = generate_with_thinking(model, tokenizer, "What is 15 + 27?")
print(f"Thinking: {thinking}")
print(f"Response: {response}")
Ollama Integration
Export to GGUF
# Export to GGUF for Ollama
model.save_pretrained_gguf(
"model",
tokenizer,
quantization_method="q4_k_m"
)
Deploy to Ollama
ollama create mymodel -f Modelfile
ollama run mymodel
Troubleshooting
Out of Memory
Symptom: CUDA out of memory error
Fix:
- Use
gradient_checkpointing="unsloth" - Reduce
per_device_train_batch_sizeto 1 - Use 4-bit quantization (
load_in_4bit=True)
NaN Loss
Symptom: Loss becomes NaN during training
Fix:
- Lower
learning_rateto 1e-5 - Check data quality (no empty samples)
- Use gradient clipping
Slow Training
Symptom: Training slower than expected
Fix:
- Ensure Unsloth is imported FIRST (before TRL)
- Use
bf16=Trueif supported - Enable
use_gradient_checkpointing="unsloth"
Kernel Shutdown (Jupyter)
SFT training uses significant GPU memory. Shutdown kernel to release memory:
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Important: Always run this at the end of training notebooks before switching to different models.
When to Use This Skill
Use when:
- Creating instruction-following models
- Fine-tuning for chat/conversation
- Adapting to domain-specific tasks
- Building custom assistants
- First step before preference optimization (DPO/GRPO)
Cross-References
bazzite-ai-jupyter:peft- LoRA configuration detailsbazzite-ai-jupyter:qlora- Advanced QLoRA experiments (alpha, rank, modules)bazzite-ai-jupyter:finetuning- General fine-tuning conceptsbazzite-ai-jupyter:dpo- Direct Preference Optimization after SFTbazzite-ai-jupyter:grpo- GRPO reinforcement learning after SFTbazzite-ai-jupyter:inference- Fast inference with vLLMbazzite-ai-jupyter:vision- Vision model fine-tuningbazzite-ai-ollama:api- Ollama deployment