name	finetune-generate
description	Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first.

Fine-tune Generate

Iteratively generate and filter training data until quality stabilizes.

Prerequisites

Complete finetune-design first. You need:

Model choice and token constraints
Input taxonomy
Evaluation rubric with calibration examples
Persona template
User simulator, assistant, and system prompts

Outputs

By the end of this phase, you will have:

training_data.jsonl — Filtered, sliced training examples
generation_stats.md — Pass rates, criterion breakdown, iterations
prompt_versions/ — History of prompt iterations

The Core Loop

This is the most important part of the entire pipeline.

┌─────────────────────────────────────────────────────────────┐
│  TIGHT LOOP (5 transcripts per iteration)                   │
│                                                             │
│  1. Generate 5 transcripts                                  │
│  2. Assess with rubric (all backends)                       │
│  3. HUMAN REVIEWS both transcripts AND assessments          │
│  4. Iterate based on human judgment                         │
│  5. Repeat until ≥70% pass rate AND human satisfied         │
│                                                             │
│  Then: Scale to full volume                                 │
└─────────────────────────────────────────────────────────────┘

Why 5 Transcripts?

Small enough for human to actually READ each one carefully
Fast feedback (minutes, not hours)
See patterns without wasting compute
Iterate while context is fresh

Why Human-in-the-Loop? (Non-Negotiable)

Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:

Human reviews...	Looking for...
Transcripts	Quality issues the rubric might miss
Assessment results	False positives (passed but shouldn't have)
Assessment results	False negatives (failed but seems fine)
Both together	Gaps in what the rubric even checks

Without human review:

You're optimizing against a potentially broken metric
False positives silently corrupt training data
Rubric blind spots never get discovered

Red Flags: Rationalizations to Resist

Rationalization	Reality
"Human review slows us down"	Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data.
"Pass rate is high, must be fine"	High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required.
"We can add calibration examples later"	Without calibration examples, backends disagree silently. Add them during design.
"The rubric is complete"	Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge.
"One assessor backend is enough"	Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures.
"Let's just scale and filter later"	Scaling before 70% pass rate wastes compute. Fix prompts first.

If you catch yourself using any of these rationalizations: STOP. Follow the gates.

Dual Iteration

You iterate on TWO things, not one:

When you see...	Iterate on...
Transcript quality issues	Generation prompts (user-sim, assistant)
Assessment seems wrong	Assessor prompt, criteria wording
Backend disagreement	Calibration examples for that criterion
Missing failure mode	Add new criterion to rubric
Pass rates high but something feels off	Run expert role-play critique

The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.

Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.

Workflow

Step 1: Tight Iteration Loop

For each batch of 5 transcripts:

Generate 5 transcripts using two-agent simulation
Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
Human reviews both transcripts and assessments:
- Read each transcript: Is this actually good?
- Read each assessment: Did the rubric catch what matters?
- Note: false positives, false negatives, missing criteria
Iterate based on human judgment:
- Fix generation prompts (if transcript quality issues)
- Fix assessor prompt/criteria (if assessment issues)
- Add calibration examples (if edge cases found)
Repeat until quality stabilizes

Gate (before scaling):

Condition	Action
≥70% pass rate AND human satisfied	Proceed to scale
50-70% OR human sees issues	Continue iterating
<50%	Major revision needed

Reference: generation-guide.md, assessment-guide.md

Step 2: Scale Generation

Once the tight loop stabilizes:

Generate target volume (100+ transcripts)
Continue assessment with same multi-backend approach
Periodic human spot-checks (every 20-50 transcripts)
Track statistics (pass rate, criterion breakdown)

Warning signs during scale:

Pass rate drifting down → Revisit prompts
New failure patterns emerging → Add criteria
Perfect scores (1.0) → Suspiciously high, investigate

Step 3: Audit Patterns

Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:

Check	Red Flag	Action
Phrase repetition	Any phrase in >50% of responses	Add to anti-patterns, regenerate
Structural rigidity	100% same format	Vary response structure
Response length ratio	Avg >2x user length	Tighten length constraints
Praise distribution	Late responses 2x more praise	Adjust tone consistency

Gate: No audit red flags

Reference: assessment-guide.md#audit-patterns

Step 4: Fixup or Reject

For failing transcripts, decide whether to fix or reject:

Failure Type	Action
Soft failures (language, tone)	Attempt fixup with entailment constraint
Safety gate failures	Truncate at failure point or reject entirely
Structural issues	Usually reject

Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.

If >30% need fixup: Generation prompts need revision.

Reference: assessment-guide.md#fixup-strategy

Step 5: Slice for Training

Create training examples from full transcripts:

50-turn transcript → ~8-10 training examples via slicing

Slicing strategy:

Random slice points (seeded by transcript ID for reproducibility)
Minimum 3 exchanges before first slice
2-5 exchange gaps between slices
Always include final turn

Token validation:

Each slice must be under your token limit (e.g., 16K)
Long transcripts may need truncation

Leakage prevention:

Split by transcript/persona FIRST
Then slice within each split
Never let slices from same transcript in both train and validation

Reference: assessment-guide.md#slicing-strategy

Optional: Use hugging-face-dataset-creator skill when ready to push training_data.jsonl to HF Hub.

Infrastructure

Checkpointing

Write progress after each transcript, not at the end:

for persona in personas:
    transcript = generate_transcript(persona)
    save_immediately(transcript)  # Don't batch

Retry with Backoff

API failures will happen. Use exponential backoff:

Claude: 7 attempts, 1-hour max wait
Google: Extract retry delay from error message
OpenAI: Standard exponential backoff

Progress Tracking

Track throughout generation:

Transcripts generated / target
Transcripts assessed / generated
Pass rate (rolling and cumulative)
Criterion failure breakdown

Reference: assessment-guide.md#infrastructure

Resources

Resource	What It Contains
code/SETUP-REFERENCE.md	Script templates: generate.py, assess.py, slice.py
code/infrastructure.py	Copy-paste ready: LLM backend, retry strategies, checkpointing
examples/therapy-domain.md	Complete therapy example: prompts, flaw patterns, criteria

Done When

Target training example count reached
Pass rate stable across last 2-3 batches (≥70%)
Human satisfied with transcript quality
Audit patterns within thresholds
training_data.jsonl validated

Next Phase

→ finetune-train

finetune-generate

Install Skill

SKILL.md