| name | fine-tuning-data-generator |
| description | Generates comprehensive synthetic fine-tuning datasets in ChatML format (JSONL) for use with Unsloth, Axolotl, and similar training frameworks. Gathers requirements, creates datasets with diverse examples, validates quality, and provides framework integration guidance. |
| version | 2 |
| allowed-tools | Read, Write, Edit, Bash |
Fine-Tuning Data Generator
This skill generates high-quality synthetic training data in ChatML format for fine-tuning language models using frameworks like Unsloth, Axolotl, or similar tools.
What Do I Need?
| Need | Resource |
|---|---|
| Planning my dataset - requirements, strategy, quality checklist | `resources/dataset-strategy.md` |
| How to create diverse examples - variation techniques, multi-turn patterns, format-specific guidance | `resources/generation-techniques.md` |
| ChatML format details - structure, specification, common issues, framework compatibility | `resources/chatml-format.md` |
| Example datasets - inspiration across domains, multi-turn samples, edge cases | `resources/examples.md` |
| Validating quality - validation workflow, analyzing datasets, troubleshooting | `resources/quality-validation.md` |
| Training & deployment - framework setup, hyperparameters, optimization, deployment | `resources/framework-integration.md` |
Workflow
Phase 1: Gather Requirements
Start with these essential clarifying questions:
Task Definition:
- What is the model being trained to do? (e.g., customer support, code generation, creative writing)
- What specific domain or subject matter? (e.g., legal, medical, e-commerce, software development)
- How many training examples are needed? (Recommend: 100+ for simple tasks, 500-1000+ for complex)
Quality & Diversity:
- Complexity range: simple to complex mix, or focus on specific difficulty level?
- Diversity: edge cases, error handling, unusual scenarios?
- Tone/style: professional, friendly, technical, concise, detailed?
- Response length preferences?
- Any specific formats: code blocks, lists, tables, JSON?
Dataset Composition:
- Distribution across subtopics: evenly distributed or weighted?
- Include negative examples (what NOT to do)?
- Need validation split? (Recommend 10-20% of total)
See `resources/dataset-strategy.md` for detailed question templates.
Phase 2: Create Generation Plan
Present a plan covering:
- Number and distribution of examples across categories
- Key topics/scenarios to cover
- Diversity strategies (phrasing variations, complexity levels, edge cases)
- System prompt approach (consistent vs. varied)
- Quality assurance approach
Get user approval before generating.
Phase 3: Generate Synthetic Data
Create examples following these quality standards:
Key Principles:
- Realistic scenarios reflecting real-world use cases
- Natural language with varied phrasing and formality levels
- Accurate, helpful responses aligned with desired behavior
- Consistent ChatML formatting throughout
- Balanced difficulty (unless specified)
- Meaningful variety (no repetition)
- Include edge cases and error scenarios
Diversity Techniques:
- Vary query phrasing (questions, commands, statements)
- Include different expertise levels (beginner, intermediate, expert)
- Cover both positive and negative examples
- Mix short and long-form responses
- Include multi-step reasoning when appropriate
- Add context variations
See `resources/generation-techniques.md` for detailed techniques, domain-specific guidance, and batch generation workflow.
Phase 4: Validate & Document
Run validation tools and checks:
# Validate JSON formatting and structure
python scripts/validate_chatml.py training_data.jsonl
# Analyze dataset statistics and diversity
python scripts/analyze_dataset.py training_data.jsonl
# Export statistics
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Quality Checklist:
- JSON validation passed (no errors)
- Analysis shows good diversity metrics
- Manual sample review passed
- No duplicate or near-duplicate examples
- All required fields present
- Realistic user queries
- Accurate, helpful responses
- Balanced category distribution
- Dataset metadata documented
See `resources/quality-validation.md` for validation details, troubleshooting, and documentation templates.
Phase 5: Integration & Training
Prepare for training with your framework of choice:
Output Files:
training_data.jsonl- Main training setvalidation_data.jsonl- Optional validation setdataset_info.txt- Metadata and statistics
Framework Setup:
- Unsloth: Automatic ChatML detection, efficient 4-bit training
- Axolotl: Specify
type: chat_templateandchat_template: chatml - Hugging Face: Use tokenizer's
apply_chat_template()method - Custom: Load from JSONL, handle ChatML formatting
See `resources/framework-integration.md` for setup code, hyperparameters, deployment options, and best practices.
ChatML Format Overview
Each training example is a JSON object with a messages array:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "Use slicing: `text[::-1]`"}]}
Roles:
system: Sets assistant behavior (optional but recommended)user: User's input/queryassistant: Model's expected response
Multi-turn: Add additional user/assistant message pairs for conversations.
See `resources/chatml-format.md` for detailed specification, validation, common issues, and framework-specific notes.
Tool Reference
Scripts in scripts/
validate_chatml.py
Validates ChatML format JSONL files:
python scripts/validate_chatml.py training_data.jsonl
python scripts/validate_chatml.py training_data.jsonl --verbose
Checks:
- Valid JSON formatting
- Required fields (messages, role, content)
- Valid role values (system, user, assistant)
- Proper message order
- Duplicate detection
- Diversity metrics
analyze_dataset.py
Provides comprehensive statistics and analysis:
python scripts/analyze_dataset.py training_data.jsonl
python scripts/analyze_dataset.py training_data.jsonl --export stats.json
Provides:
- Dataset overview (total examples, message counts)
- Message length statistics
- System prompt variations
- User query patterns (questions, commands, code-related, length categories)
- Assistant response patterns (code blocks, lists, headers, length categories)
- Quality indicators (diversity score, balance ratio)
- Token estimates and cost projection
Common Workflows
Small Dataset (100-200 examples)
- Gather requirements
- Create generation plan for 1-2 categories
- Generate in single batch, review quality
- Validate and document
- Ready for training
Medium Dataset (500-1000 examples)
- Gather requirements
- Create detailed plan with multiple categories
- Generate in 2-3 batches, reviewing after each
- Analyze diversity and adjust approach
- Fill any gaps
- Final validation and documentation
Large Dataset (2000+ examples)
- Gather comprehensive requirements
- Create multi-batch generation plan
- Batch 1 (50-100): Foundation examples
- Batch 2 (100-200): Complexity expansion
- Batch 3 (100-200): Coverage filling
- Batch 4 (50-100): Polish and validation
- Run full validation suite
- Generate comprehensive documentation
Best Practices
Start Small, Iterate
- Generate 10-20 examples first
- Review and get feedback
- Refine approach based on feedback
- Scale up to full dataset
Quality Over Quantity
- Better to have 500 diverse, high-quality examples than 5,000 repetitive ones
- Each example should teach something new
- Maintain consistent response quality throughout
Diversify Systematically
- Vary query phrasing (questions, commands, statements)
- Cover different expertise levels
- Mix response complexities
- Include edge cases (typically 20-30% of dataset)
- Use batch generation workflow for large datasets
Test Before Deployment
- Test dataset with actual training framework
- Monitor training metrics for issues
- Test fine-tuned model outputs before deployment
- Compare results to base model
Document Everything
- Keep notes on generation parameters
- Save different dataset versions
- Document any modifications made
- Record generation strategies used
- Track model performance metrics
Advanced Features
Batch Generation Strategy
For datasets 500+ examples:
- Generate 50-100 examples at a time
- Review distribution and diversity after each batch
- Adjust generation strategy based on identified gaps
- Prevents repetition and maintains creativity
Common Pitfalls to Avoid
- Over-templating: Creates repetitive patterns (vary naturally)
- Unrealistic Queries: Overly formal/robotic user inputs (use varied phrasing)
- Narrow Coverage: Limited scenarios and phrasing (ensure diversity)
- Inconsistent Quality: Quality degradation over time (use quality checklist)
- JSON Errors: Invalid formatting breaking training (always validate)
- Missing Context: System prompts without detail (provide clear instructions)
- Response Mismatch: Responses don't address queries (verify relevance)
Dataset Size Recommendations
| Task Complexity | Recommended Size | Notes |
|---|---|---|
| Simple tasks | 100-500 | Well-defined, limited variation |
| Medium tasks | 500-2,000 | Multiple scenarios, moderate complexity |
| Complex tasks | 2,000-10,000+ | Many edge cases, high variability |
| Domain adaptation | 1,000-5,000 | Specialized knowledge required |
Resources
- Planning & Strategy: `resources/dataset-strategy.md` - Requirements gathering, planning, quality checklists
- Generation Techniques: `resources/generation-techniques.md` - Diversity techniques, domain-specific guidance, batch workflows
- ChatML Specification: `resources/chatml-format.md` - Format details, validation, framework notes
- Example Datasets: `resources/examples.md` - Diverse domain examples, multi-turn patterns
- Quality Validation: `resources/quality-validation.md` - Validation workflow, analysis, troubleshooting
- Framework Integration: `resources/framework-integration.md` - Setup for Unsloth, Axolotl, HuggingFace; deployment options
Version: 2.0 | Updated: 2024 | Pattern: Modular Orchestration