name	finetune-train
description	Use when training a fine-tuned model and evaluating improvement over base model. Triggers - have filtered training data, ready to submit training job, need to convert to GGUF. Requires finetune-generate first.

Fine-tune Train

Train the model and verify improvement over base model.

Prerequisites

Complete finetune-generate first. You need:

training_data.jsonl — Filtered, validated training examples
Model choice from finetune-design
Evaluation rubric (for comparing base vs fine-tuned)

Scope: SFT Only

This skill covers Supervised Fine-Tuning (SFT) only. Other training methods (DPO, GRPO, GEPA) will hopefully be added in the future.

Related HuggingFace Skills

Use these HuggingFace skills for an integrated workflow:

Skill	When to Use
`model-trainer`	Primary training skill — Trackio integration, TRL patterns, HF Jobs submission
`hugging-face-dataset-creator`	Push/manage datasets on HF Hub
`hugging-face-evaluation-manager`	Add eval results to model cards after training

Outputs

By the end of this phase, you will have:

Fine-tuned model (adapter on HuggingFace Hub)
Merged GGUF file for local deployment
GGUF uploaded to HuggingFace Hub (for others to download)
evaluation_report.md — Statistical comparison of base vs fine-tuned

Workflow

Step 1: Dataset Preparation

Format and upload your training data:

Verify format matches training framework expectations:

{"messages": [
  {"role": "system", "content": "..."},
  {"role": "user", "content": "..."},
  {"role": "assistant", "content": "..."}
]}

Push to HuggingFace Hub using the datasets library:

from datasets import Dataset
import json

# Load your JSONL
examples = [json.loads(line) for line in open("training_data.jsonl")]

# Keep only the messages field for training
training_data = [{"messages": ex["messages"]} for ex in examples]

# Push to Hub
dataset = Dataset.from_list(training_data)
dataset.push_to_hub("username/dataset-name", private=True)

Verify access — test loading the dataset before submitting training job:

from datasets import load_dataset
ds = load_dataset("username/dataset-name", split="train")
print(f"Loaded {len(ds)} examples")

Optional: Use hugging-face-dataset-creator skill for streamlined HF Hub dataset management.

Gate: Dataset uploaded and accessible

Step 2: Choose Training Approach

Approach	Best For
HuggingFace Jobs	Fast iteration, serverless GPU, minimal setup
MLX Local	Apple Silicon, no cloud dependency
Cloud GPU	Full control, large jobs

HuggingFace Jobs is recommended for most projects.

Also consider Thinking Machines.

Reference: training-guide.md

Step 3: Select GPU Based on Context Length

Critical: Vocabulary size dominates memory at long contexts.

Logits are computed in FP32 regardless of quantization:

Formula: vocab_size × sequence_length × 4 bytes

Model	Vocab Size	Logits @ 16k tokens	GPU Required
Gemma 3 12B	262K	~17 GB	A100
Qwen3 14B	152K	~10 GB	A100
Llama 3 8B	128K	~8 GB	A10G or A100

GPU Selection Guide:

Context Length	Gemma 3 (262K vocab)	Qwen3 (152K vocab)	Llama 3 (128K vocab)
2048 tokens	A10G (24GB)	A10G (24GB)	A10G (24GB)
8192 tokens	A100 (80GB)	A10G (24GB)	A10G (24GB)
16384 tokens	A100 (80GB)	A100 (80GB)	A100 (80GB)

Rule of thumb: If your conversations average 8k+ tokens and you're using Gemma 3, use A100.

Step 4: Configure Training

QLoRA parameters (typical):

peft_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

Training hyperparameters with Trackio:

config = SFTConfig(
    output_dir="model-name",
    push_to_hub=True,
    hub_model_id="username/model-name",
    hub_strategy="every_save",  # Push checkpoints

    # Quantization
    model_init_kwargs={
        "load_in_4bit": True,
        "bnb_4bit_quant_type": "nf4",
        "bnb_4bit_compute_dtype": torch.bfloat16,
        "bnb_4bit_use_double_quant": True,
        "device_map": "auto",
    },

    # Training
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    max_length=16384,  # Adjust based on GPU (see Step 3)
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_8bit",

    # Logging & checkpointing
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,

    # Trackio monitoring (RECOMMENDED)
    report_to="trackio",
)

Reference: training-guide.md#configuration

Step 5: Submit Training

Use hf jobs uv run CLI:

# CRITICAL: Flags MUST come BEFORE script path
hf jobs uv run \
    --flavor a100-large \
    --timeout 6h \
    --secrets HF_TOKEN \
    scripts/train_model.py

Common mistakes:

# ❌ WRONG: flags after script (will be ignored!)
hf jobs uv run train.py --flavor a100-large

# ❌ WRONG: --secret (singular) instead of --secrets (plural)
hf jobs uv run --secret HF_TOKEN train.py

# ✅ CORRECT: flags before script
hf jobs uv run --flavor a100-large --secrets HF_TOKEN train.py

Monitor training:

hf jobs ps                    # List jobs
hf jobs logs <job_id>         # View logs
hf jobs inspect <job_id>      # Job details

Trackio dashboard: https://huggingface.co/spaces/username/trackio

Reference: training-guide.md#hf-jobs

Step 6: GGUF Conversion & Upload

Convert the fine-tuned adapter to GGUF and optionally upload to Hub.

Memory requirements: Merging requires ~28GB RAM for 14B model (bfloat16).

Manual Workflow (Recommended for macOS):

# 1. Clone llama.cpp (NOT Homebrew - version mismatch issues)
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
pip install -r ~/llama.cpp/requirements.txt

# 2. Download base model (resumable)
hf download Qwen/Qwen3-14B --local-dir ~/models/qwen3-14b-base

# 3. Download adapter
hf download username/my-finetuned-qwen3-14b --local-dir ./models/qwen3-finetuned/adapter

# 4. Merge adapter (bfloat16, ~28GB RAM)
uv run python scripts/merge_lora_adapter.py \
    --base-model ~/models/qwen3-14b-base \
    --adapter-path ./models/qwen3-finetuned/adapter \
    --output-dir ./models/qwen3-finetuned/merged

# 5. Convert to GGUF
uv run python ~/llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --outfile ./models/qwen3-finetuned/model-bf16.gguf \
    ./models/qwen3-finetuned/merged

# 6. Quantize (Homebrew llama-quantize works)
llama-quantize \
    ./models/qwen3-finetuned/model-bf16.gguf \
    ./models/qwen3-finetuned/model-q4_k_m.gguf \
    Q4_K_M

Automated Script (Alternative):

uv run python scripts/convert_to_gguf.py \
    --adapter-repo username/my-finetuned-qwen3-14b \
    --base-model Qwen/Qwen3-14B \
    --output-dir ./models/qwen3-finetuned \
    --upload

Test locally:

llama-server -m ./models/qwen3-finetuned/model-q4_k_m.gguf --port 8080 -ngl 99

Reference: training-guide.md#gguf-conversion

Step 7: Evaluation

Compare fine-tuned model against base model using pairwise full-conversation assessment.

Evaluation methodology:

Generate NEW personas using seeds not in training data (e.g., seeds 9000+)
Run BOTH models on the same personas with the same user simulator
Assess all conversations with your rubric (same assessor, same criteria)
Compare scores pairwise using statistical test (paired t-test)

Why this methodology:

New seeds: Prevents evaluation on training distribution — tests generalization
Pairwise comparison: Same persona + same user simulator = controlled comparison
Full conversations: Tests multi-turn dynamics, not just single-turn quality

Multi-Model Comparison Workflow:

# Step 1: Generate evaluation personas (uses seeds 9000+, not in training)
uv run python scripts/generate_eval_personas.py --count 15
# Output: data/eval/personas.json

# Step 2: Start model servers on different ports
# Terminal 1: Baseline
llama-server -m gemma-3-12b-it.gguf --port 8080 -ngl 99
# Terminal 2: Fine-tuned Gemma
llama-server -m finetuned-gemma.gguf --port 8081 -ngl 99
# Terminal 3: Fine-tuned Qwen (if comparing multiple)
llama-server -m finetuned-qwen.gguf --port 8082 -ngl 99

# Step 3: Run evaluation
uv run python scripts/run_model_evaluation.py \
    --personas data/eval/personas.json \
    --output-dir data/eval/results

# Step 4: Review report at data/eval/results/evaluation_report.md

Evaluation scripts:

Script	Purpose
`scripts/generate_eval_personas.py`	Generate NEW personas for evaluation
`scripts/run_model_evaluation.py`	Run multi-model comparison with statistical tests
`scripts/merge_lora_adapter.py`	Merge LoRA adapter with base model (bfloat16)
`scripts/convert_to_gguf.py`	End-to-end: download, merge, convert, upload

Success criteria:

Improvement: ≥10% absolute improvement over baseline
Significance: p < 0.05 (paired t-test)
Safety: No regressions on safety criteria

Statistical comparison:

from scipy import stats
import numpy as np

# Paired t-test (same personas, same user simulator)
t_stat, p_value = stats.ttest_rel(finetuned_scores, base_scores)
improvement = np.mean(finetuned_scores) - np.mean(base_scores)

Reference: training-guide.md#evaluation

Optional: Use hugging-face-evaluation-manager skill to add evaluation results to your model card on HF Hub.

Step 8: Sanity Checks

Before declaring success, verify:

Check	Purpose
Perplexity on held-out set	Did training actually work?
Small human eval (5-10 convos)	Does LLM-as-judge agree with humans?
Capability regression test	Didn't break general abilities
Safety regression	No new harmful patterns

Warning signs:

Fine-tuned worse than base → Training issue, check data quality
Huge improvement (>30%) → Suspiciously high, verify evaluation
Safety regressions → Do not deploy

Decision: Ship or Iterate?

Result	Action
≥10% improvement, p<0.05, no regressions	Ship it
<10% improvement	Consider more/better training data
Not significant (p>0.05)	More evaluation data or training data
Safety regressions	Do not deploy, investigate

Red Flags: Rationalizations to Resist

Rationalization	Reality
"Perplexity improved, we're done"	Low perplexity ≠ good conversations. Full-conversation eval required.
"It feels better, ship it"	Feelings aren't evidence. Run statistical comparison (p<0.05).
"Default hyperparameters are fine"	Large-vocab models (Gemma 3) OOM with defaults. Check max_length and GPU.
"Skip GGUF, we'll deploy later"	GGUF conversion is the deployment. Test locally before declaring success.
"Safety check is paranoid"	Fine-tuning can introduce regressions. Safety audit is mandatory.
"A10G is fine for everything"	Gemma 3 with 16k context needs A100 due to 262K vocabulary.

Model Naming Conventions

Different model families use different naming conventions for instruction-tuned variants:

Family	Base Model	Instruction-Tuned
Gemma 3	`google/gemma-3-12b`	`google/gemma-3-12b-it` (add `-it`)
Qwen3	`Qwen/Qwen3-14B-Base`	`Qwen/Qwen3-14B` (base has `-Base` suffix)
Llama 3	`meta-llama/Llama-3-8B`	`meta-llama/Llama-3-8B-Instruct`

For SFT fine-tuning, always use the instruction-tuned variant to preserve instruction-following capabilities.

Done When

Training completed successfully
GGUF converted and tested locally
GGUF uploaded to HuggingFace Hub
Evaluation shows significant improvement (≥10%, p<0.05)
No safety regressions
evaluation_report.md documents results

Resources

Resource	What It Contains
training-guide.md	Complete training guide with Trackio, CLI syntax, troubleshooting
code/SETUP-REFERENCE.md	Project structure, script templates
code/infrastructure.py	Copy-paste ready: token counting, slice generation
examples/therapy-domain.md	Complete therapy example: evaluation results, model choice

HuggingFace Hub integration (skills):

Skill	Use For
`model-trainer`	Primary HF training skill with Trackio, TRL patterns
`hugging-face-dataset-creator`	Push/manage datasets on HF Hub
`hugging-face-evaluation-manager`	Add eval results to model cards

What's Next?

After successful fine-tuning:

Deploy model locally (Ollama, llama.cpp)
Monitor real-world usage for issues
Collect feedback for future iterations
Consider DPO/RLHF if further refinement needed

finetune-train

Install Skill

SKILL.md