name	finetune-prep
description	This skill should be used when the user wants to "prepare for fine-tuning", "create synthetic training data", "build evaluation rubric", "extract domain knowledge with LLM", "design data generation pipeline", or needs to establish quality standards before training an LLM on a specialized domain.

Fine-Tuning Preparation with LLM-Based Domain Extraction

Overview

This skill guides the complete preparation process for fine-tuning LLMs on specialized domains using LLM-as-expert for domain knowledge extraction, synthetic data generation, and evaluation design.

Core principle: Quality training data requires explicit domain knowledge, behavioral anchors, and binary evaluation criteria - all created BEFORE generating a single training example.

Key insight: Using LLMs to role-play domain experts works when expertise is encoded into structured prompts with concrete behavioral anchors, failure mode guards, and binary quality gates.

When to Use

Use this skill when:

Preparing to fine-tune a model for a specialized domain (therapy, legal, customer support, etc.)
Building a synthetic data pipeline from LLM-generated expertise
Need to define "good" before generating training data
Working with 7B-70B parameter models where base performance is poor
Have no direct access to domain experts but can research frameworks

Do NOT use for:

Simple prompt engineering (no fine-tuning needed)
Domains where you have abundant real training data
Tasks where base model already performs well
Quick prototyping without quality gates

The Process

Phase 1: Domain Knowledge Extraction
         ↓
Phase 2: Behavioral Anchor Definition
         ↓
Phase 3: Binary Evaluation Rubric
         ↓
Phase 4: Evaluation Prompt Design
         ↓
Phase 5: Generation Prompt Design

Critical order: Evaluation criteria MUST be defined before generation prompts. You cannot improve what you cannot measure.

Phase 1: Domain Knowledge Extraction

Goal: Research and synthesize domain frameworks into structured reference documentation.

Steps

Identify relevant frameworks for the domain
- Therapy → CBT, DBT, ACT
- Legal → Legal reasoning frameworks, precedent analysis
- Customer support → Service recovery, escalation frameworks
- Code review → Style guides, best practices, security patterns
Research each framework
- Use WebSearch for current best practices (include year: 2025)
- Focus on: core principles, specific techniques, conversation patterns
- Capture: what practitioners DO (observable behaviors), not just theory
Search for documented failure modes
- AI chatbot failures in this domain (search: "AI [domain] mistakes research")
- Common human errors in the domain
- Documented safety issues and ethical violations
Create framework reference document
- Synthesize frameworks with concrete examples
- Include conversation snippets showing techniques in practice
- Document failure modes with specific examples
- Cross-reference techniques (when to use X vs Y)

Output: references/domain-frameworks.md - comprehensive reference covering frameworks, techniques, communication principles, and failure modes.

Quality check:

Does it include CONCRETE examples (not just abstractions)?
Are techniques named and explained with conversation examples?
Are failure modes specific (with real examples)?
Could you generate 100 conversations from this without running out of patterns?

Phase 2: Behavioral Anchor Definition

Goal: Transform abstract qualities into observable DO/DON'T behaviors.

Why this matters: "Be empathetic" is not actionable. "Acknowledge emotion before problem-solving" is.

Steps

Extract ALWAYS DO behaviors
- From frameworks: What should EVERY response do?
- From communication principles: Observable actions (reflect back, ask permission, etc.)
- Make each item behavioral (verb-based, observable)
- Target: 6-10 specific behaviors
Extract NEVER DO behaviors
- From failure modes: What AI chatbots do wrong?
- From domain ethics: What crosses boundaries?
- From safety research: What causes harm?
- Target: 8-12 specific anti-patterns
Ensure orthogonality
- Each behavior checks something distinct
- No redundancy ("validate feelings" vs "show empathy" → pick one, define clearly)
- Behaviors should be independently verifiable

Output: Behavioral anchors section in references/domain-frameworks.md with clear DO/DON'T lists.

Quality check:

Are behaviors observable (not "be kind" but "avoid dismissive language like 'at least'")?
Are they actionable (can you write a prompt incorporating them)?
Do they cover the failure modes you documented?

Phase 3: Binary Evaluation Rubric

Goal: Create composable binary questions that measure quality without LLM judgment ambiguity.

Core principle: Binary questions (YES/NO) are more reliable than Likert scales (1-5) for LLM-as-judge.

Steps

Identify evaluation dimensions
- Start with user experience goals (feel heard? get value? not annoyed?)
- Map to observable categories (comprehension, connection, usefulness, fit)
- Add safety as automatic-fail gate
- Target: 4-6 categories
Write binary questions per category
- Each question checks ONE specific thing
- Phrased for presence ("Does it do X?") when possible
- Absence-based only for safety ("Does it avoid Y?")
- Target: 3-4 questions per category
Ensure orthogonality
- Test: Could response pass Q1 but fail Q2 in same category?
- If always correlated, questions are redundant → merge or remove
- Each question should catch distinct failure mode
Define scoring logic
- Category scores = sum of YES answers / questions in category
- Weighted final score = sum of (category score × weight)
- Safety = gate (any NO → automatic fail)
- Pass threshold (typically 0.70-0.80)
Write example evaluations
- Strong response (should score high)
- Mediocre response (should fail)
- Safety violation (automatic fail)
- Include reasoning for each question

Output: references/evaluation-rubric.md with binary questions, scoring logic, and examples.

Quality check:

Total questions: 15-20 (not more - judge fatigue)
Each question is truly binary (no ambiguity)
Orthogonal (no double-counting)
Safety gate is explicit
Examples demonstrate the rubric works

Phase 3b: Expert Role-Play Critique (Optional but Recommended)

Goal: Stress-test your rubric by having LLM role-play domain experts who critique your criteria.

Why this works: Different experts see different failure modes. A clinical psychologist catches different issues than a UX researcher than an AI safety researcher. Role-playing experts surfaces blind spots you'd otherwise miss.

Steps

Identify relevant expert perspectives for your domain

Examples by domain:
- Therapy → Clinical psychologists, specific therapy creators (CBT, DBT), patient advocates
- Customer support → Service design experts, accessibility advocates, angry customer personas
- Legal → Practicing attorneys, legal ethicists, opposing counsel perspective
- Code review → Security researchers, performance engineers, junior developer perspective

For each expert, prompt Claude to role-play and critique

Role-play as [Expert Name/Type], a [credentials/perspective].

Critically review these evaluation criteria for [domain].
Focus on:
- What failure modes would this rubric miss?
- What would pass this rubric but still be harmful/inadequate?
- What would fail this rubric but actually be appropriate?
- Are there user populations this rubric doesn't serve?

Be constructively critical. Suggest specific improvements.

Synthesize critiques into rubric improvements

Common discoveries:
- Missing criteria for specific user populations (neurodivergent, non-Western, etc.)
- Over-strict criteria that penalize valid approaches
- Under-strict criteria that let problems through
- Criteria that conflict with each other
Include fictional experts for edge perspectives

Fictional characters from your domain can surface non-obvious issues:
- A skeptical user who's been burned before
- A vulnerable user in crisis
- A user from a different cultural context

Example: Therapy Domain Expert Panel

Experts consulted:
- Marsha Linehan (DBT creator) → Caught missing dialectical synthesis in validation criteria
- William Miller (MI creator) → Identified that empowerment criteria missed solution origin
- Irvin Yalom (existential therapy) → Added presence-without-insight as valid therapeutic outcome
- Emily Bender (computational linguist) → Reframed "AI cannot feel" to "user inference" perspective
- Percy Liang (LLM evaluation) → Switched from confirm-first to evidence-first judge reasoning

Output: Updated rubric with expert-informed improvements.

When to skip: Simple domains with well-understood failure modes. But for any human-facing AI with emotional/cultural stakes, this step catches critical blind spots.

Phase 4: Evaluation Prompt Design

Goal: Create prompt that applies rubric consistently to any conversation.

Steps

Structure the prompt

Role & Context → Rubric → Input Format → Output Format → Examples

Role & Context
- You are evaluating [domain] responses
- Goal: Apply binary rubric to assess quality
- Output will be used to filter training data
Present rubric
- All categories and questions
- Definitions for any ambiguous terms
- Safety questions first (fail-fast)
Define input format
- User message + Assistant response
- Optional: conversation context
Require structured output
- JSON with question IDs mapped to true/false
- Example: {"CP1": true, "CP2": false, ...}
Include calibration examples
- 2-3 examples with expected answers
- Show edge cases and reasoning

Output: prompts/evaluation-prompt.md

Quality check:

Prompt is self-contained (doesn't require reading other docs)
Examples cover edge cases
Output format is machine-parseable
Safety questions come first

Phase 5: Generation Prompt Design

Goal: Create prompt that produces training data passing your rubric.

Critical: Generation prompt encodes all the domain knowledge you've extracted.

Steps

Structure the prompt

Role & Expertise → Framework Grounding → Behavioral Anchors →
Failure Mode Guards → Few-Shot Examples → Output Format

Role & Expertise
- Who is the model being? (therapist, lawyer, support agent)
- What's the relationship? (coach, advisor, assistant)
- What frameworks inform this role?
Framework Grounding
- Reference specific techniques from domain-frameworks.md
- When to use which technique
- How techniques manifest in language (concrete examples)
Behavioral Anchors
- Embed ALWAYS DO list
- Embed NEVER DO list
- Make each item explicit (not "be empathetic" but specific behaviors)
Failure Mode Guards
- Address each documented failure mode
- Provide specific counters ("If user mentions X → respond with Y pattern")
- Include crisis detection if domain-appropriate
Few-Shot Examples
- 3-5 exemplar conversations
- Show diversity (different techniques, situations, tones)
- Each example should pass your rubric
- Annotate WHY each is good
Output Format
- Specify conversation structure
- Turn-taking expectations
- Length guidelines

Output: prompts/generation-prompt.md

Quality check:

Could someone unfamiliar with your domain use this prompt to generate good examples?
Does it address all failure modes you documented?
Are behavioral anchors embedded (not referenced)?
Few-shot examples actually exemplify the quality you want?

Workflow Integration

Once all 5 phases complete:

Pilot generation (100 examples)
- Use generation prompt
- Run through evaluation prompt
- Check pass rate (target >70%)
Manual spot-check (10 examples)
- You personally review
- Do they feel right?
- Any failure modes missed?
Iterate prompts if needed
- Low pass rate → strengthen generation prompt
- Wrong things passing → tighten rubric
- Systematic failures → add failure mode guards
Scale generation (2K-5K examples)
- Once pilot validated
- Same prompts, larger volume
- Automated filtering via evaluation prompt

Common Mistakes to Avoid

Mistake	Why It Fails	Fix
"Skip frameworks, just use GPT's knowledge"	GPT's implicit knowledge has gaps and biases	Explicit framework research grounds generation
"Write generation prompt first"	Can't hit a target you haven't defined	Evaluation criteria first, always
"Use 1-5 scales for evaluation"	LLM judges are inconsistent with Likert	Binary questions are more reliable
"100 questions in rubric for thoroughness"	Judge fatigue, noise, cost	15-20 orthogonal questions maximum
"Behavioral anchors can be abstract"	"Be empathetic" ≠ actionable	Observable verbs only
"Skip failure mode research"	You'll rediscover all known problems	Literature search saves time
"Generate then evaluate to see what works"	Expensive iteration cycle	Design evaluation criteria first

File Organization

After completing this process, you should have:

project/
├── skills/
│   └── finetune-prep/
│       ├── SKILL.md (this file)
│       └── references/
│           ├── domain-frameworks.md
│           └── evaluation-rubric.md
├── reference/
│   ├── domain-frameworks.md (working version)
│   └── evaluation-rubric.md (working version)
└── prompts/
    ├── evaluation-prompt.md
    └── generation-prompt.md

Next Steps (Outside This Skill)

After preparation is complete:

Pilot data generation (use generation prompt)
Evaluation and filtering (use evaluation prompt)
Training (use existing HF trainer skill/tools)
Post-training evaluation (same rubric on model outputs)

Red Flags - Stop and Reconsider

These thoughts indicate you're skipping critical steps:

"Domain frameworks are obvious, I'll skip research" → Research anyway, capture specifics
"I'll write rubric and generation prompt together" → Evaluation first, always
"Abstract qualities are fine for behavioral anchors" → Make them observable
"10 questions is too few" → Fewer, better questions > many weak signals
"I'll test prompts by generating and seeing what happens" → Define quality criteria first
"This domain is too subjective for binary questions" → Decompose into observable components

All of these mean: Follow the process. Quality compounds.

Key Principles

Evaluation before generation - Define good before making examples
Explicit over implicit - LLM knowledge is incomplete, make it explicit
Observable over abstract - "Validate feelings" → "Acknowledge emotion before problem-solving"
Binary over scaled - YES/NO > 1-5 for LLM judges
Orthogonal over redundant - 15 distinct signals > 30 correlated ones
Failure-informed over theory-only - Research what goes wrong in practice

Time Investment

Expected time for full preparation:

Phase 1 (Frameworks): 2-3 hours (research + synthesis)
Phase 2 (Anchors): 1 hour (extraction + refinement)
Phase 3 (Rubric): 2-3 hours (design + examples)
Phase 4 (Eval prompt): 1 hour (writing + testing)
Phase 5 (Gen prompt): 2 hours (encoding knowledge + examples)

Total: 8-10 hours of preparation

This feels like overhead but saves 20-40 hours of iteration on bad training data.

Success Criteria

You've completed this process successfully when:

Domain frameworks doc covers techniques with conversation examples
Failure modes are documented with specific instances
Behavioral anchors are observable (verb-based)
Rubric has 15-20 orthogonal binary questions
Rubric examples show it catches failures
Evaluation prompt is self-contained with calibration examples
Generation prompt embeds all domain knowledge explicitly
Generation prompt includes failure mode guards
Few-shot examples in generation prompt pass your rubric
Pilot generation (100 examples) has >70% pass rate

Deploy to training pipeline only after all checkboxes pass.

This skill encodes a methodology for fine-tuning preparation developed through critical analysis of synthetic data quality, evaluation reliability, and LLM-as-expert limitations.

Install Skill

SKILL.md

Fine-Tuning Preparation with LLM-Based Domain Extraction

Overview

When to Use

The Process

Phase 1: Domain Knowledge Extraction

Steps

Phase 2: Behavioral Anchor Definition

Steps

Phase 3: Binary Evaluation Rubric

Steps

Phase 3b: Expert Role-Play Critique (Optional but Recommended)

Steps

Example: Therapy Domain Expert Panel

Phase 4: Evaluation Prompt Design

Steps

Phase 5: Generation Prompt Design

Steps

Workflow Integration

Common Mistakes to Avoid

File Organization

Next Steps (Outside This Skill)

Red Flags - Stop and Reconsider

Key Principles

Time Investment

Success Criteria