name	finetune-design
description	Use when preparing to fine-tune an LLM for multi-turn conversations, before generating any training data. Triggers - starting a fine-tuning project, need to define evaluation criteria, designing conversation data generation.

Fine-tune Design

Design all artifacts needed before generating training data for multi-turn conversation fine-tuning.

Inputs

Domain to fine-tune for (customer support, coaching, tutoring, etc.)
Deployment constraints (hardware, offline requirement, budget)
Access to domain expertise (or ability to research it)

Outputs

By the end of this phase, you will have:

model-choice.md — Selected model with documented tradeoffs
config/input-taxonomy.yaml — Topics, styles, difficulty, edge cases
config/rubric.yaml — Binary criteria with calibration examples
config/persona-template.yaml — Diversity dimensions and distributions
config/prompts/user_sim.md — User simulator prompt
config/prompts/assistant.md — Assistant generation prompt
config/system-prompt.md — System prompt for training data
base-model-eval-results.md — Baseline evaluation results

Required Technique: Expert Role-Play Critique

Apply this to EVERY design artifact. Role-play domain experts (real or fictional) to stress-test your designs before committing.

Apply To	Experts to Consider
Taxonomy	Domain practitioners, user researchers, edge case specialists
Rubric	Quality experts, safety specialists, methodology creators
Personas	User advocates, accessibility experts, diverse user representatives
Prompts	Domain practitioners, AI safety researchers, communication experts

Process:

Identify 5-7 relevant experts for your domain
Have Claude role-play each expert critiquing your design
Ask: "What would pass this but still be inadequate? What user populations does this miss?"
Synthesize feedback into improvements

This catches blind spots invisible from your own perspective. One project discovered 6 critical rubric gaps through expert critique that would have corrupted training data.

Full guide: assessment-guide.md#expert-role-play-critique

Workflow

Step 1: Base Model Selection

Select the model you'll fine-tune based on:

Factor	Why It Matters
Context window	Max conversation length you can train on
Quantization support	GGUF, MLX, QAT for local deployment
Base capability	Evaluate before committing
Training cost	LoRA/QLoRA vs full fine-tune
Deployment target	Ollama, llama.cpp, MLX

Gate: Model chosen with documented tradeoffs in model-choice.md

Reference: model-selection-guide.md

Step 2: Token Economics

Determine training constraints based on cost:

Tokens/Example	Cost Impact
<8K	Cheapest, short conversations only
8-16K	Cost-effective, moderate conversations
16-32K	Expensive, long conversations
>32K	Very expensive, may require special handling

Constraint: Plan max conversation length based on your budget. 16K is a practical ceiling for most projects.

Gate: Max transcript token length defined

Reference: model-selection-guide.md#token-economics

Step 3: Input Taxonomy

Define the distribution of inputs to generate. A good taxonomy has multiple dimensions:

Dimension	Question	Examples
WHAT	What are they asking about?	Topics, subtopics
HOW	How do they communicate?	Style, verbosity, tone
WHO	Who are they?	Demographics, context
DIFFICULTY	How hard is this to handle?	Easy, medium, hard
EDGE CASES	What should trigger special handling?	Boundaries, safety

Key lesson: Allocate ~15% to edge cases. Without explicit representation, the model won't learn to handle them.

→ Apply Expert Role-Play: Have domain experts critique your taxonomy for missing topics and user types.

Gate: Weighted taxonomy with cross-product dimensions in config/input-taxonomy.yaml

Reference: taxonomy-guide.md

Step 4: Evaluation Rubric

Design quality criteria for assessing generated conversations.

Critical requirements:

Binary judgments (YES/NO/NA) — not numeric scales
Grouped into weighted categories
Safety gates that auto-reject on failure
3-8 calibration examples per criterion (essential for multi-backend consistency)

Why calibration examples are non-negotiable: During generation, you'll run assessment with multiple LLM backends (Claude, GPT, Gemini) to catch blind spots. Without calibration examples, backends interpret criteria differently — 20-30% disagreement is common. Calibration examples anchor consistent interpretation.

Structure:

categories:
  comprehension:
    weight: 0.15
    criteria: [CQ1, CQ2]
  # ... more categories

criteria:
  CQ1:
    name: "Accurate understanding"
    question: "Does the response demonstrate accurate understanding?"
    na_valid: false  # Must always be assessable
    calibration_examples:
      - type: PASS
        context: "..."
        response: "..."
        reasoning: "..."
      - type: FAIL
        # ...

safety_gates: [CQ8, CQ9]  # Any failure = auto-reject
pass_threshold: 0.80

→ Apply Expert Role-Play: Have quality experts critique your criteria for blind spots and edge cases.

Gate: Rubric with calibration examples in config/rubric.yaml

Reference: rubric-guide.md

Step 5: Persona Template

Design user diversity for realistic training data.

Dimensions to define:

Communication style (terse, verbose, emotional, analytical)
Behavior patterns / "flaws" (resistance, deflection, etc.)
Domain-specific attributes (varies by domain)

Key lesson: Flaws vary per message, not per conversation. Real people have good days and bad days.

→ Apply Expert Role-Play: Have user advocates critique your personas for missing populations and unrealistic patterns.

persona_template:
  communication_style:
    options: [terse, casual, formal, stream-of-consciousness]
    weights: [0.15, 0.50, 0.25, 0.10]

  flaw_patterns:
    primary: # 50% chance per message
    secondary: # 20% chance each per message

  # 20% of personas should have NO flaw patterns

Gate: Persona template with distributions in config/persona-template.yaml

Reference: persona-guide.md

Step 6: Prompts

Create the three prompts for data generation:

Prompt	Purpose
User simulator	Generate realistic user messages with flaws
Assistant	Generate high-quality responses
System prompt	What gets baked into training data

Key lessons for assistant prompt:

Length matching: Target 1.0-1.5x user word count, hard limit 2x
Tentative language for interpretations ("I wonder if..." not "You are...")
Question discipline: At most 1-2 questions per response
Anti-patterns list: Specific phrases to avoid

→ Apply Expert Role-Play: Have domain experts critique your prompts for missing requirements and problematic patterns.

Gate: All three prompts drafted

Reference: generation-guide.md (in finetune-generate)

Step 7: Base Model Evaluation

Before committing to fine-tune, evaluate the base model on your rubric.

Process:

Generate 10-20 test scenarios covering your taxonomy
Have base model respond to each
Assess with your rubric
Calculate pass rate

Decision gate:

Pass Rate	Recommendation
>70%	Base model may be sufficient. Consider prompt engineering first.
50-70%	Fine-tuning likely helpful. Moderate improvement expected.
<50%	Fine-tuning needed. Significant improvement expected.

Gate: Base model evaluated, decision to proceed documented in base-model-eval-results.md

A Note on Numbers

All numeric parameters in these guides (15% edge cases, 50%/20% flaw probabilities, 0.80 pass threshold, etc.) are starting points from one successful project, not universal truths. Calibrate them for your domain based on pilot generation results and human review.

Red Flags: Rationalizations to Resist

Rationalization	Reality
"Base model is obviously not good enough"	Evaluate anyway. You need baseline numbers for comparison.
"I'll use numeric scales (1-5), it's fine"	Numeric scales drift across assessors. Binary judgments are consistent.
"Calibration examples are overkill"	Without examples, backends interpret criteria differently. 20-30% disagreement.
"Edge cases are rare, skip them"	Without ~15% edge case representation, model fails at boundaries.
"I know what users want, skip taxonomy"	Your intuition is biased. Formal taxonomy ensures coverage.
"Expert role-play takes too long"	1 hour of critique catches blind spots that corrupt 100+ transcripts. Do it.

Done When

All 8 output files created
Expert role-play critique applied to taxonomy, rubric, personas, and prompts
Base model evaluated against rubric
Decision to proceed with fine-tuning documented
Ready to start finetune-generate phase

Resources

Resource	What It Contains
code/SETUP-REFERENCE.md	Project structure and file templates
code/infrastructure.py	Copy-paste ready: LLM backend, checkpointing, slicing, scoring
examples/therapy-domain.md	Complete therapy domain example: taxonomy, flaws, rubric criteria

Next Phase

→ finetune-generate

finetune-design

Install Skill

SKILL.md