name

data-wrangler

description

Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

Data Wrangler

Transform and export data using DuckDB SQL.

Usage - Command syntax and Windows escaping
Explore Mode - Quick data profiling
Query Mode - Return results to Claude
Write Mode - Export to files
Request/Response Format - JSON structure
Source Types - File, database, and cloud sources
Transformations - SQL patterns reference
Secrets - Secure credential handling

Usage

IMPORTANT - Windows Shell Escaping:

Always cd to the skill directory first
Use double quotes for echo with escaped inner quotes (\")
Use forward slashes in file paths

cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py

Explore Mode

Get schema, statistics, and sample in one call. Use before writing queries to understand data structure.

{"mode": "explore", "path": "D:/data/sales.csv"}

Response:

{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}

Options:

sample_rows: Number of sample rows (default: 10, max: 100)
sources: For database tables (same as query mode)

Query Mode

Return results directly to Claude for analysis.

Direct File Queries

{"query": "SELECT * FROM 'data.csv' LIMIT 10"}

Multi-Source Joins

{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}

Write Mode

Export query results to files. Add an output object to write instead of returning data.

Basic Write

{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}

Write with Options

{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}

Output Formats

Format	Options
`parquet`	`compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size`
`csv`	`header` (default: true), `delimiter`, `compression`, `partition_by`
`json`	`array` (true=JSON array, false=newline-delimited)

Write Response

Response includes verification info - no need for follow-up queries:

{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}

Overwrite Protection

By default, existing files are not overwritten. Set options.overwrite: true to allow.

Request/Response Format

Request

{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}

Query Mode Options

max_rows: Maximum rows to return (default: 200)
max_bytes: Maximum response size (default: 200000)
format: markdown (default), json, records, or csv

Query Mode Response (markdown)

| column1 | column2 |
|---|---|
| value1 | value2 |

Query Mode Response (json)

{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}

Source Types

File (auto-detects CSV, Parquet, JSON, Excel)

{"type": "file", "alias": "data", "path": "/path/to/file.csv"}

Glob patterns: {"path": "/logs/**/*.parquet"}

Custom delimiter: {"path": "/data/file.csv", "delimiter": "|"}

PostgreSQL

{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}

MySQL

{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}

SQLite

{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}

S3

{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}

Transformations

See TRANSFORMS.md for advanced patterns including:

PIVOT/UNPIVOT - Reshape data between wide and long formats
Sampling - Random subsets with USING SAMPLE n ROWS or SAMPLE 10%
Dynamic columns - EXCLUDE, REPLACE, COLUMNS('pattern')
Window functions - Running totals, rankings, moving averages
Date/time operations - Extraction, arithmetic, formatting

Quick Examples

-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users

Workflow

Inspect schema: DESCRIBE SELECT * FROM 'file.csv'
Preview data: SELECT * FROM 'file.csv' LIMIT 5
Transform: Apply filters, joins, aggregations
Export (optional): Add output to write results

Error Handling

If error is non-null: Check column names, verify paths
If truncated is true: Use more aggregation or filters
If write fails with "exists": Set options.overwrite: true

Secrets

Store credentials securely in YAML. See SECRETS.md for complete documentation.

{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.