| name | data-wrangler |
| description | Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations. |
Data Wrangler
Transform and export data using DuckDB SQL.
Contents
- Usage - Command syntax and Windows escaping
- Explore Mode - Quick data profiling
- Query Mode - Return results to Claude
- Write Mode - Export to files
- Request/Response Format - JSON structure
- Source Types - File, database, and cloud sources
- Transformations - SQL patterns reference
- Secrets - Secure credential handling
Usage
IMPORTANT - Windows Shell Escaping:
- Always
cdto the skill directory first - Use double quotes for echo with escaped inner quotes (
\") - Use forward slashes in file paths
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py
Explore Mode
Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.
{"mode": "explore", "path": "D:/data/sales.csv"}
Response:
{
"file": "D:/data/sales.csv",
"format": "csv",
"row_count": 15234,
"columns": [
{"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
{"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
],
"sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001 | Alice | ... |"
}
Options:
sample_rows: Number of sample rows (default: 10, max: 100)sources: For database tables (same as query mode)
Query Mode
Return results directly to Claude for analysis.
Direct File Queries
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}
Multi-Source Joins
{
"query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
"sources": [
{"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
{"type": "file", "alias": "products", "path": "/data/products.csv"}
]
}
Write Mode
Export query results to files. Add an output object to write instead of returning data.
Basic Write
{
"query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
"output": {
"path": "D:/output/filtered.parquet",
"format": "parquet"
}
}
Write with Options
{
"query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
"output": {
"path": "D:/output/events/",
"format": "parquet",
"options": {
"compression": "zstd",
"partition_by": ["year", "month"],
"overwrite": true
}
}
}
Output Formats
| Format | Options |
|---|---|
parquet |
compression (zstd/snappy/gzip/lz4), partition_by, row_group_size |
csv |
header (default: true), delimiter, compression, partition_by |
json |
array (true=JSON array, false=newline-delimited) |
Write Response
Response includes verification info - no need for follow-up queries:
{
"success": true,
"output_path": "D:/output/events/",
"format": "parquet",
"rows_written": 15234,
"files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
"total_size_bytes": 5678901,
"duration_ms": 1234
}
Overwrite Protection
By default, existing files are not overwritten. Set options.overwrite: true to allow.
Request/Response Format
Request
{
"query": "SQL statement",
"sources": [...],
"output": {"path": "...", "format": "..."},
"options": {"max_rows": 200, "format": "markdown"},
"secrets_file": "path/to/secrets.yaml"
}
Query Mode Options
max_rows: Maximum rows to return (default: 200)max_bytes: Maximum response size (default: 200000)format:markdown(default),json,records, orcsv
Query Mode Response (markdown)
| column1 | column2 |
|---|---|
| value1 | value2 |
Query Mode Response (json)
{
"schema": [{"name": "col1", "type": "INTEGER"}],
"rows": [[1, "value"]],
"truncated": false,
"warnings": [],
"error": null
}
Source Types
File (auto-detects CSV, Parquet, JSON, Excel)
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}
Glob patterns: {"path": "/logs/**/*.parquet"}
Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}
PostgreSQL
{
"type": "postgres", "alias": "users",
"host": "host", "port": 5432, "database": "db",
"user": "user", "password": "pass",
"schema": "public", "table": "users"
}
MySQL
{
"type": "mysql", "alias": "orders",
"host": "host", "port": 3306, "database": "db",
"user": "user", "password": "pass", "table": "orders"
}
SQLite
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}
S3
{
"type": "s3", "alias": "logs",
"url": "s3://bucket/path/*.parquet",
"aws_region": "us-east-1",
"aws_access_key_id": "...", "aws_secret_access_key": "..."
}
Transformations
See TRANSFORMS.md for advanced patterns including:
- PIVOT/UNPIVOT - Reshape data between wide and long formats
- Sampling - Random subsets with
USING SAMPLE n ROWSorSAMPLE 10% - Dynamic columns -
EXCLUDE,REPLACE,COLUMNS('pattern') - Window functions - Running totals, rankings, moving averages
- Date/time operations - Extraction, arithmetic, formatting
Quick Examples
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region
-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount
-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)
-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users
Workflow
- Inspect schema:
DESCRIBE SELECT * FROM 'file.csv' - Preview data:
SELECT * FROM 'file.csv' LIMIT 5 - Transform: Apply filters, joins, aggregations
- Export (optional): Add
outputto write results
Error Handling
- If
erroris non-null: Check column names, verify paths - If
truncatedis true: Use more aggregation or filters - If write fails with "exists": Set
options.overwrite: true
Secrets
Store credentials securely in YAML. See SECRETS.md for complete documentation.
{
"query": "SELECT * FROM customers LIMIT 10",
"secrets_file": "D:/path/to/secrets.yaml",
"sources": [{
"type": "postgres", "alias": "customers",
"secret": "my_postgres", "table": "customers"
}]
}
Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.