| name | data-designer |
| description | Generate high-quality synthetic datasets using statistical samplers and Claude's native LLM capabilities. Use when users ask to create synthetic data, generate datasets, create fake/mock data, generate test data, training data, or any data generation task. Supports CSV, JSON, JSONL, Parquet output. Adapted from NVIDIA NeMo DataDesigner (Apache 2.0). |
Data Designer
Generate synthetic datasets combining statistical samplers with Claude's LLM capabilities. No external API keys required.
Workflow
- Clarify requirements - Ask about purpose, columns, size, format
- Create schema - Write
dataset_schema.jsondefining columns - Generate preview - Run
batch_generator.pyfor 3-5 rows - Iterate - Refine based on feedback
- Generate full dataset - Batch generate, then merge
- Deliver - Export to requested format
Column Types
Statistical Samplers (No LLM)
| Type | Description | Key Params |
|---|---|---|
category |
Weighted random choice | values, weights |
subcategory |
Hierarchical (parent-based) | mapping, category |
uniform |
Uniform distribution | low, high, dtype |
gaussian |
Normal distribution | mean, std, min_val, max_val |
bernoulli |
Binary probability | p, true_value, false_value |
poisson |
Poisson distribution | mean |
datetime |
Random dates | start, end, format |
person |
Synthetic personas | fields, age_range, locale |
uuid |
Unique IDs | prefix, format |
LLM Columns (Claude generates)
| Type | Description |
|---|---|
llm_text |
Free-form text |
llm_code |
Code with syntax validation |
llm_structured |
JSON matching schema |
llm_judge |
Quality scoring |
Schema Format
Create dataset_schema.json:
{
"name": "dataset_name",
"seed": 42,
"columns": [
{"name": "category", "type": "category", "params": {"values": ["A","B"], "weights": [0.6,0.4]}},
{"name": "text", "type": "llm_text", "prompt": "Write about {{ category }}.", "depends_on": ["category"]}
],
"output": {"format": "csv", "filename": "output"}
}
For full schema reference: references/schema.md
Jinja2 Templating
Reference columns in prompts:
Write a {{ rating }}-star review for {{ product_name }} by {{ customer.first_name }}.
Supports: {{ var }}, {{ obj.field }}, {% if %}, filters
Scripts
Generate Data
# Preview
python scripts/batch_generator.py --schema schema.json --rows 5 --output preview.json --preview
# Full generation
python scripts/batch_generator.py --schema schema.json --rows 100 --batch-size 20 --output batches/
Merge & Export
python scripts/merger.py --input batches/ --output dataset.csv --flatten
Formats: csv, json, jsonl, parquet
Generation Strategy
- Sampler columns first - Python scripts, fast
- LLM columns in dependency order - Topological sort by
depends_on - Batch processing - Generate in batches of 20-50 for large datasets
For LLM columns, Claude generates directly:
- Render Jinja2 prompt with row data
- Generate content
- Validate if configured
- Retry on failure (max 3)
Examples
Simple:
"Generate 50 product reviews with ratings 1-5"
Complex:
"Create 200 support tickets with: ticket_id (UUID), customer (name, email), category (billing/technical/general), priority (1-5 gaussian), description (LLM)"
Code:
"Generate 100 Python functions with description, code (validated), tests"
Tips
- Use
seedfor reproducibility - Preview first, then scale
- Keep LLM prompts specific
- Use
subcategoryfor correlated data
Attribution
Adapted from NVIDIA NeMo DataDesigner (Apache 2.0).