| name | dataset-improvement |
| description | Extract, review, and replace lines in JSONL dataset files for hand-crafted quality improvements. Use when improving synthetic training data, fixing thinking blocks, or manually editing dataset examples. |
| allowed-tools | Read, Bash, Write |
Dataset Improvement Skill
Manually improve quality of thinking blocks in synthetic training dataset JSONL files through systematic extraction and replacement.
Instructions
This Skill provides bash scripts to:
- Extract specific lines from JSONL files for manual review
- Replace lines with hand-crafted improvements
- Manage backups automatically
Workflow
Extract lines for review:
cd .claude/skills/synethetic-data-generation ./scripts/improve_dataset.sh <file> <start_line> [count]Review extracted examples - Identify quality issues:
- Requirements duplicating plan?
- Weak or generic memory?
- Incorrect confidence calibration?
- Missing contextual details?
Hand-craft improvements following quality guidelines (see below)
Replace line with improved version:
./scripts/replace_lines.sh <file> <line_num> '<improved_json_content>'Repeat for additional lines in batch
Quality Guidelines
Goal: Make specific and actionable
- ❌ "Update content"
- ✅ "Append 'New entry' to Content/log.md to maintain chronological record"
Memory: Add rich context (WHY, WHAT BEFORE, broader situation)
- ❌ "User updating content"
- ✅ "User maintaining activity log for Content Hub. Previously added 8 entries this week tracking blog post drafts and milestones. Log serves as project timeline for content team."
Requirements: Verification checks (DISTINCT from plan)
- Must focus on what to CHECK before acting
- Example:
["Verify file exists", "Confirm write permissions", "Check file isn't locked"]
Plan: Execution steps (DISTINCT from requirements)
- Must focus on what to DO
- Example:
["Verify file exists", "Execute append", "Confirm success", "Report to user"]
Confidence: Risk-calibrated scoring
- 0.3-0.5: Risky (delete, replace, overwrite)
- 0.6-0.8: Moderate (update, batch operations)
- 0.85-0.95: Safe (read, list, search, append)
Examples
Example 1: Extract 5 lines from contentManager dataset
./scripts/improve_dataset.sh ../../Datasets/tools_datasets/thinking/contentManager/tools_v1.5.jsonl 100 5
Output shows lines 100-104 for review.
Example 2: Replace line 100 with improved version
./scripts/replace_lines.sh ../../Datasets/tools_datasets/thinking/contentManager/tools_v1.5.jsonl 100 '{"conversations":[...]}'
Creates backup at tools_v1.5.jsonl.backup and replaces line 100.
Example 3: Process dataset in batches
# Extract batch of 10
./scripts/improve_dataset.sh dataset.jsonl 1 10
# Review and improve each line manually
# Replace each improved line
./scripts/replace_lines.sh dataset.jsonl 1 '<improved_line_1>'
./scripts/replace_lines.sh dataset.jsonl 2 '<improved_line_2>'
# ... continue for all 10
# Move to next batch
./scripts/improve_dataset.sh dataset.jsonl 11 10
Best Practices
- Work in small batches - Process 10-20 examples at a time
- Track progress - Keep notes on which lines completed
- Validate JSON - Use
jqto verify syntax:echo '<json>' | jq - Review backups - Check
.backupfiles before continuing - Use line numbers - Files use 1-indexed line numbers
Safety Features
- Automatic
.backupfiles created before replacement - Line-by-line processing prevents bulk errors
- Manual review required at each step
- JSON validation recommended before replacement
Related Tools
jq- Parse and validate JSONdiff- Compare original vs improvedwc -l- Count total lines in datasetsed -n 'Np'- Preview single line N