| name | finetune-generate |
| description | Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first. |
Fine-tune Generate
Iteratively generate and filter training data until quality stabilizes.
Prerequisites
Complete finetune-design first. You need:
- Model choice and token constraints
- Input taxonomy
- Evaluation rubric with calibration examples
- Persona template
- User simulator, assistant, and system prompts
Outputs
By the end of this phase, you will have:
-
training_data.jsonl— Filtered, sliced training examples -
generation_stats.md— Pass rates, criterion breakdown, iterations -
prompt_versions/— History of prompt iterations
The Core Loop
This is the most important part of the entire pipeline.
┌─────────────────────────────────────────────────────────────┐
│ TIGHT LOOP (5 transcripts per iteration) │
│ │
│ 1. Generate 5 transcripts │
│ 2. Assess with rubric (all backends) │
│ 3. HUMAN REVIEWS both transcripts AND assessments │
│ 4. Iterate based on human judgment │
│ 5. Repeat until ≥70% pass rate AND human satisfied │
│ │
│ Then: Scale to full volume │
└─────────────────────────────────────────────────────────────┘
Why 5 Transcripts?
- Small enough for human to actually READ each one carefully
- Fast feedback (minutes, not hours)
- See patterns without wasting compute
- Iterate while context is fresh
Why Human-in-the-Loop? (Non-Negotiable)
Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:
| Human reviews... | Looking for... |
|---|---|
| Transcripts | Quality issues the rubric might miss |
| Assessment results | False positives (passed but shouldn't have) |
| Assessment results | False negatives (failed but seems fine) |
| Both together | Gaps in what the rubric even checks |
Without human review:
- You're optimizing against a potentially broken metric
- False positives silently corrupt training data
- Rubric blind spots never get discovered
Red Flags: Rationalizations to Resist
| Rationalization | Reality |
|---|---|
| "Human review slows us down" | Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data. |
| "Pass rate is high, must be fine" | High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required. |
| "We can add calibration examples later" | Without calibration examples, backends disagree silently. Add them during design. |
| "The rubric is complete" | Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge. |
| "One assessor backend is enough" | Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures. |
| "Let's just scale and filter later" | Scaling before 70% pass rate wastes compute. Fix prompts first. |
If you catch yourself using any of these rationalizations: STOP. Follow the gates.
Dual Iteration
You iterate on TWO things, not one:
| When you see... | Iterate on... |
|---|---|
| Transcript quality issues | Generation prompts (user-sim, assistant) |
| Assessment seems wrong | Assessor prompt, criteria wording |
| Backend disagreement | Calibration examples for that criterion |
| Missing failure mode | Add new criterion to rubric |
| Pass rates high but something feels off | Run expert role-play critique |
The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.
Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.
Workflow
Step 1: Tight Iteration Loop
For each batch of 5 transcripts:
- Generate 5 transcripts using two-agent simulation
- Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
- Human reviews both transcripts and assessments:
- Read each transcript: Is this actually good?
- Read each assessment: Did the rubric catch what matters?
- Note: false positives, false negatives, missing criteria
- Iterate based on human judgment:
- Fix generation prompts (if transcript quality issues)
- Fix assessor prompt/criteria (if assessment issues)
- Add calibration examples (if edge cases found)
- Repeat until quality stabilizes
Gate (before scaling):
| Condition | Action |
|---|---|
| ≥70% pass rate AND human satisfied | Proceed to scale |
| 50-70% OR human sees issues | Continue iterating |
| <50% | Major revision needed |
Reference: generation-guide.md, assessment-guide.md
Step 2: Scale Generation
Once the tight loop stabilizes:
- Generate target volume (100+ transcripts)
- Continue assessment with same multi-backend approach
- Periodic human spot-checks (every 20-50 transcripts)
- Track statistics (pass rate, criterion breakdown)
Warning signs during scale:
- Pass rate drifting down → Revisit prompts
- New failure patterns emerging → Add criteria
- Perfect scores (1.0) → Suspiciously high, investigate
Step 3: Audit Patterns
Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:
| Check | Red Flag | Action |
|---|---|---|
| Phrase repetition | Any phrase in >50% of responses | Add to anti-patterns, regenerate |
| Structural rigidity | 100% same format | Vary response structure |
| Response length ratio | Avg >2x user length | Tighten length constraints |
| Praise distribution | Late responses 2x more praise | Adjust tone consistency |
Gate: No audit red flags
Reference: assessment-guide.md#audit-patterns
Step 4: Fixup or Reject
For failing transcripts, decide whether to fix or reject:
| Failure Type | Action |
|---|---|
| Soft failures (language, tone) | Attempt fixup with entailment constraint |
| Safety gate failures | Truncate at failure point or reject entirely |
| Structural issues | Usually reject |
Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.
If >30% need fixup: Generation prompts need revision.
Reference: assessment-guide.md#fixup-strategy
Step 5: Slice for Training
Create training examples from full transcripts:
50-turn transcript → ~8-10 training examples via slicing
Slicing strategy:
- Random slice points (seeded by transcript ID for reproducibility)
- Minimum 3 exchanges before first slice
- 2-5 exchange gaps between slices
- Always include final turn
Token validation:
- Each slice must be under your token limit (e.g., 16K)
- Long transcripts may need truncation
Leakage prevention:
- Split by transcript/persona FIRST
- Then slice within each split
- Never let slices from same transcript in both train and validation
Reference: assessment-guide.md#slicing-strategy
Optional: Use hugging-face-dataset-creator skill when ready to push training_data.jsonl to HF Hub.
Infrastructure
Checkpointing
Write progress after each transcript, not at the end:
for persona in personas:
transcript = generate_transcript(persona)
save_immediately(transcript) # Don't batch
Retry with Backoff
API failures will happen. Use exponential backoff:
- Claude: 7 attempts, 1-hour max wait
- Google: Extract retry delay from error message
- OpenAI: Standard exponential backoff
Progress Tracking
Track throughout generation:
- Transcripts generated / target
- Transcripts assessed / generated
- Pass rate (rolling and cumulative)
- Criterion failure breakdown
Reference: assessment-guide.md#infrastructure
Resources
| Resource | What It Contains |
|---|---|
| code/SETUP-REFERENCE.md | Script templates: generate.py, assess.py, slice.py |
| code/infrastructure.py | Copy-paste ready: LLM backend, retry strategies, checkpointing |
| examples/therapy-domain.md | Complete therapy example: prompts, flaw patterns, criteria |
Done When
- Target training example count reached
- Pass rate stable across last 2-3 batches (≥70%)
- Human satisfied with transcript quality
- Audit patterns within thresholds
-
training_data.jsonlvalidated