Claude Code Plugins

Community-maintained marketplace

Feedback

Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name finetune-generate
description Use when generating synthetic training data for multi-turn conversation fine-tuning. Triggers - have design artifacts ready, need to generate conversations, ready to assess quality. Requires finetune-design first.

Fine-tune Generate

Iteratively generate and filter training data until quality stabilizes.

Prerequisites

Complete finetune-design first. You need:

  • Model choice and token constraints
  • Input taxonomy
  • Evaluation rubric with calibration examples
  • Persona template
  • User simulator, assistant, and system prompts

Outputs

By the end of this phase, you will have:

  • training_data.jsonl — Filtered, sliced training examples
  • generation_stats.md — Pass rates, criterion breakdown, iterations
  • prompt_versions/ — History of prompt iterations

The Core Loop

This is the most important part of the entire pipeline.

┌─────────────────────────────────────────────────────────────┐
│  TIGHT LOOP (5 transcripts per iteration)                   │
│                                                             │
│  1. Generate 5 transcripts                                  │
│  2. Assess with rubric (all backends)                       │
│  3. HUMAN REVIEWS both transcripts AND assessments          │
│  4. Iterate based on human judgment                         │
│  5. Repeat until ≥70% pass rate AND human satisfied         │
│                                                             │
│  Then: Scale to full volume                                 │
└─────────────────────────────────────────────────────────────┘

Why 5 Transcripts?

  • Small enough for human to actually READ each one carefully
  • Fast feedback (minutes, not hours)
  • See patterns without wasting compute
  • Iterate while context is fresh

Why Human-in-the-Loop? (Non-Negotiable)

Human review is required, not optional. The human reviews BOTH transcripts AND assessment results:

Human reviews... Looking for...
Transcripts Quality issues the rubric might miss
Assessment results False positives (passed but shouldn't have)
Assessment results False negatives (failed but seems fine)
Both together Gaps in what the rubric even checks

Without human review:

  • You're optimizing against a potentially broken metric
  • False positives silently corrupt training data
  • Rubric blind spots never get discovered

Red Flags: Rationalizations to Resist

Rationalization Reality
"Human review slows us down" Skipping review = optimizing against broken metric. 1 hour of review saves days of bad data.
"Pass rate is high, must be fine" High pass rate with single backend misses 20-30% of issues. Multi-backend + human review required.
"We can add calibration examples later" Without calibration examples, backends disagree silently. Add them during design.
"The rubric is complete" Rubrics evolve (e.g., 12→18 criteria). New failure modes emerge.
"One assessor backend is enough" Single backend gave transcript 1000 perfect 1.0; other backends caught 4 failures.
"Let's just scale and filter later" Scaling before 70% pass rate wastes compute. Fix prompts first.

If you catch yourself using any of these rationalizations: STOP. Follow the gates.

Dual Iteration

You iterate on TWO things, not one:

When you see... Iterate on...
Transcript quality issues Generation prompts (user-sim, assistant)
Assessment seems wrong Assessor prompt, criteria wording
Backend disagreement Calibration examples for that criterion
Missing failure mode Add new criterion to rubric
Pass rates high but something feels off Run expert role-play critique

The rubric is never "done." In one project, criteria evolved: 12 → 14 → 16 → 17 → 18.

Expert role-play critique is required — periodically have Claude role-play domain experts to critique your rubric and small transcript batch directly. This catches blind spots invisible from your own perspective. See assessment-guide.md#expert-role-play-critique.


Workflow

Step 1: Tight Iteration Loop

For each batch of 5 transcripts:

  1. Generate 5 transcripts using two-agent simulation
  2. Assess with rubric using multiple backends (Claude, Gemini, GPT-5)
  3. Human reviews both transcripts and assessments:
    • Read each transcript: Is this actually good?
    • Read each assessment: Did the rubric catch what matters?
    • Note: false positives, false negatives, missing criteria
  4. Iterate based on human judgment:
    • Fix generation prompts (if transcript quality issues)
    • Fix assessor prompt/criteria (if assessment issues)
    • Add calibration examples (if edge cases found)
  5. Repeat until quality stabilizes

Gate (before scaling):

Condition Action
≥70% pass rate AND human satisfied Proceed to scale
50-70% OR human sees issues Continue iterating
<50% Major revision needed

Reference: generation-guide.md, assessment-guide.md


Step 2: Scale Generation

Once the tight loop stabilizes:

  1. Generate target volume (100+ transcripts)
  2. Continue assessment with same multi-backend approach
  3. Periodic human spot-checks (every 20-50 transcripts)
  4. Track statistics (pass rate, criterion breakdown)

Warning signs during scale:

  • Pass rate drifting down → Revisit prompts
  • New failure patterns emerging → Add criteria
  • Perfect scores (1.0) → Suspiciously high, investigate

Step 3: Audit Patterns

Run quantitative analysis on the full dataset to catch issues invisible in spot-checks:

Check Red Flag Action
Phrase repetition Any phrase in >50% of responses Add to anti-patterns, regenerate
Structural rigidity 100% same format Vary response structure
Response length ratio Avg >2x user length Tighten length constraints
Praise distribution Late responses 2x more praise Adjust tone consistency

Gate: No audit red flags

Reference: assessment-guide.md#audit-patterns


Step 4: Fixup or Reject

For failing transcripts, decide whether to fix or reject:

Failure Type Action
Soft failures (language, tone) Attempt fixup with entailment constraint
Safety gate failures Truncate at failure point or reject entirely
Structural issues Usually reject

Entailment constraint: Fixed response must naturally lead to user's next message. If fix breaks continuity → truncate instead.

If >30% need fixup: Generation prompts need revision.

Reference: assessment-guide.md#fixup-strategy


Step 5: Slice for Training

Create training examples from full transcripts:

50-turn transcript → ~8-10 training examples via slicing

Slicing strategy:

  • Random slice points (seeded by transcript ID for reproducibility)
  • Minimum 3 exchanges before first slice
  • 2-5 exchange gaps between slices
  • Always include final turn

Token validation:

  • Each slice must be under your token limit (e.g., 16K)
  • Long transcripts may need truncation

Leakage prevention:

  • Split by transcript/persona FIRST
  • Then slice within each split
  • Never let slices from same transcript in both train and validation

Reference: assessment-guide.md#slicing-strategy

Optional: Use hugging-face-dataset-creator skill when ready to push training_data.jsonl to HF Hub.


Infrastructure

Checkpointing

Write progress after each transcript, not at the end:

for persona in personas:
    transcript = generate_transcript(persona)
    save_immediately(transcript)  # Don't batch

Retry with Backoff

API failures will happen. Use exponential backoff:

  • Claude: 7 attempts, 1-hour max wait
  • Google: Extract retry delay from error message
  • OpenAI: Standard exponential backoff

Progress Tracking

Track throughout generation:

  • Transcripts generated / target
  • Transcripts assessed / generated
  • Pass rate (rolling and cumulative)
  • Criterion failure breakdown

Reference: assessment-guide.md#infrastructure


Resources

Resource What It Contains
code/SETUP-REFERENCE.md Script templates: generate.py, assess.py, slice.py
code/infrastructure.py Copy-paste ready: LLM backend, retry strategies, checkpointing
examples/therapy-domain.md Complete therapy example: prompts, flaw patterns, criteria

Done When

  • Target training example count reached
  • Pass rate stable across last 2-3 batches (≥70%)
  • Human satisfied with transcript quality
  • Audit patterns within thresholds
  • training_data.jsonl validated

Next Phase

finetune-train