name	duckdb-remote-parquet-query
description	Query remote Parquet files via HTTP without downloading using DuckDB httpfs. Leverage column pruning, row filtering, and range requests for efficient bandwidth usage. Use for crypto/trading data distribution and analytics.

DuckDB Remote Parquet Query

Overview

This skill teaches how to query remote Parquet files over HTTP/HTTPS without downloading the entire file, using DuckDB's httpfs extension. This pattern is essential for distributing large datasets (crypto OHLCV data, trade history, orderbook snapshots) while allowing users to run filtered queries with minimal bandwidth.

Core Pattern: DuckDB httpfs extension → HTTP range requests → Column/row pruning → Query only needed data

Use Cases:

Query 20 MB Parquet database to get 1 symbol's history (downloads ~60 KB vs 20 MB)
Aggregate daily statistics without downloading raw tick data
Explore schema and sample data before committing to full download
Build APIs that serve filtered subsets from static Parquet files

When to Use This Skill

Use this skill when you need to:

Distribute Data - Serve Parquet files to users without building a backend API
Query Large Files - Access specific rows/columns from multi-GB Parquet files
Minimize Bandwidth - Users on slow networks or metered connections
Optimize Documentation - Show users how to query your published datasets efficiently
Prototype APIs - Test query patterns before building dedicated infrastructure

Common Questions:

"How do I let users query my Parquet database without downloading it?"
"What's the most efficient way to serve historical OHLCV data?"
"Can DuckDB query GitHub Releases directly?"
"How do I reduce query latency for remote Parquet files?"

Quick Start

Prerequisites

# Install DuckDB (Python)
pip install duckdb>=1.0.0

# Or install DuckDB CLI
brew install duckdb  # macOS

Minimal Example

import duckdb

# 1. Setup connection with httpfs extension
conn = duckdb.connect(":memory:")
conn.execute("INSTALL httpfs")
conn.execute("LOAD httpfs")

# 2. Query remote Parquet file (replace URL)
url = "https://cdn.jsdelivr.net/gh/org/repo@tag/data.parquet"

result = conn.execute(f"""
    SELECT date, symbol, close_price
    FROM read_parquet('{url}')
    WHERE symbol = 'BTCUSDT'
      AND date >= '2024-01-01'
    LIMIT 10
""").fetchall()

# Network transfer: ~60 KB (not full 20 MB file!)
for row in result:
    print(row)

Key Points:

No full download required (uses HTTP range requests)
Column pruning: Only reads date, symbol, close_price columns
Row filtering: WHERE clause skips irrelevant row groups
Works with any HTTP/HTTPS URL serving Parquet files

Core Concepts

HTTP Range Requests

DuckDB's httpfs extension uses HTTP Range: bytes=X-Y headers to read only required portions of Parquet file:

Metadata read - Fetch file footer (< 10 KB) to understand schema and row groups
Column chunks - Download only requested columns from relevant row groups
Row groups - Skip row groups outside WHERE filter range

Example:

Parquet file: 20 MB, 1M rows, 20 columns
Query: SELECT date FROM table WHERE symbol = 'BTCUSDT' (1 of 327 symbols)
Network transfer: ~60 KB (0.3% of file size)

Requirements:

Server must return Accept-Ranges: bytes header
Works with: S3, GCS, Azure Blob, CDNs (jsDelivr, Cloudflare)
Doesn't work: GitHub Releases direct URLs (use jsDelivr proxy)

Column Pruning

Only downloads columns referenced in SELECT/WHERE/ORDER BY:

-- Bad: Downloads all 20 columns
SELECT * FROM read_parquet('https://example.com/data.parquet')

-- Good: Downloads only 3 columns (85% bandwidth reduction)
SELECT date, symbol, volume FROM read_parquet('https://example.com/data.parquet')

Predicate Pushdown

DuckDB pushes WHERE filters to Parquet reader, skipping entire row groups:

-- Skips row groups where date < '2024-01-01'
SELECT * FROM read_parquet('https://example.com/data.parquet')
WHERE date >= '2024-01-01'  -- Filter pushed to Parquet reader

Best filters (highest selectivity):

Date ranges (if Parquet sorted by date)
Equality: symbol = 'BTCUSDT'
IN clauses: symbol IN ('BTCUSDT', 'ETHUSDT')

Using Bundled Resources

`scripts/remote_query_example.py`

Comprehensive working examples demonstrating:

Basic queries - Count rows, schema inspection
Column pruning - SELECT specific columns to reduce bandwidth
Row filtering - WHERE clauses with predicate pushdown
Aggregations - Remote grouping to minimize network transfer
Performance comparison - Measure speedup from optimization
Local caching - Download filtered subset once, query many times
Export - Save filtered data to local Parquet/CSV

Usage:

# View examples
cat skills/duckdb-remote-parquet-query/scripts/remote_query_example.py

# Run examples (replace URL first!)
python skills/duckdb-remote-parquet-query/scripts/remote_query_example.py

Copy-paste friendly: Each function is self-contained, can be adapted to your use case.

`references/performance-optimization.md`

In-depth guide covering:

Optimization patterns - Column pruning, predicate pushdown, aggregation before export
Performance benchmarks - Bandwidth/latency for different query types
Troubleshooting - Diagnose slow queries, high bandwidth usage
CDN proxy setup - Use jsDelivr for GitHub Releases (no native range support)
Configuration tuning - HTTP timeout, retries, memory limits

Usage: Reference when optimizing query performance or troubleshooting issues.

Workflow

Step 1: Verify Server Supports Range Requests

# Check HTTP headers
curl -I https://example.com/data.parquet

# Look for:
# HTTP/1.1 200 OK
# Accept-Ranges: bytes  ← Required for efficient queries

If missing: Use CDN proxy (jsDelivr, Cloudflare) or download full file.

Step 2: Install and Load httpfs Extension

import duckdb

conn = duckdb.connect(":memory:")  # Or persistent: "local.duckdb"
conn.execute("INSTALL httpfs")    # Download extension (once)
conn.execute("LOAD httpfs")       # Load into connection

Step 3: Inspect Schema (Optional but Recommended)

url = "https://cdn.jsdelivr.net/gh/org/repo@tag/data.parquet"

# Check available columns and types
schema = conn.execute(f"""
    DESCRIBE SELECT * FROM read_parquet('{url}')
""").fetchall()

for col in schema:
    print(f"{col[0]:<20} {col[1]:<15}")  # column_name, type

Why: Avoid trial-and-error queries, plan optimal SELECT statement.

Step 4: Write Optimized Query

Checklist:

✅ SELECT only needed columns (not SELECT *)
✅ WHERE filter on high-selectivity columns (date, symbol)
✅ Aggregate remotely if possible (GROUP BY reduces network transfer)
✅ LIMIT with WHERE filter (avoid ORDER BY without filter)

Example:

result = conn.execute(f"""
    SELECT date, symbol, volume, close_price
    FROM read_parquet('{url}')
    WHERE symbol = 'BTCUSDT'
      AND date >= '2024-01-01'
    ORDER BY date DESC
    LIMIT 100
""").fetchall()

Step 5: Cache Locally if Repeated Queries

# Download filtered subset once
conn.execute(f"""
    CREATE TABLE local_cache AS
    SELECT *
    FROM read_parquet('{url}')
    WHERE date >= '2024-01-01'
      AND symbol IN ('BTCUSDT', 'ETHUSDT', 'SOLUSDT')
""")

# Run many queries on local table (fast, no network)
counts = conn.execute("SELECT symbol, COUNT(*) FROM local_cache GROUP BY symbol").fetchall()
trends = conn.execute("SELECT date, AVG(volume) FROM local_cache GROUP BY date").fetchall()

When to cache:

Running >3 queries on same data
Interactive exploration (Jupyter notebooks)
Dashboards that refresh frequently

Step 6: Export Results

# Export to Parquet
conn.execute("""
    COPY (SELECT * FROM local_cache)
    TO '/tmp/filtered_data.parquet' (FORMAT PARQUET)
""")

# Export to CSV
conn.execute("""
    COPY (SELECT * FROM local_cache)
    TO '/tmp/filtered_data.csv' (HEADER, DELIMITER ',')
""")

Domain Context: Crypto/Trading Data

This skill is optimized for:

Data Types:

OHLCV klines (1m, 5m, 1h, 1d bars)
Trade ticks (timestamp, price, quantity, side)
Orderbook snapshots (bid/ask levels, depths)
Funding rates, liquidations, open interest
Availability/coverage metadata (which symbols have data for which dates)

Common Patterns:

Symbol filtering: WHERE symbol IN (...) (1 of 300+ symbols → 99% data skipped)
Date ranges: WHERE date BETWEEN '2024-01-01' AND '2024-12-31'
Volume thresholds: WHERE volume > 1000000 (filter low-liquidity periods)
Aggregations: Daily/hourly rollups from minute data

Performance Examples (20 MB Parquet, 327 symbols, 2019-2025 history):

Full scan: 20 MB download, 3.0s
Single symbol: 60 KB download, 0.3s (10x faster)
Daily aggregation: 10 KB download, 0.2s (15x faster)

Tips for Success

Always SELECT specific columns - SELECT * defeats purpose of remote queries
Test filters for selectivity - symbol = 'BTCUSDT' (high) vs volume > 0 (low)
Aggregate remotely - GROUP BY date reduces network transfer 100-1000x
Use CDN proxy for GitHub Releases - jsDelivr adds range request support
Cache locally for exploration - Download once, query many times
Check schema first - DESCRIBE SELECT * shows available columns without data transfer
Monitor bandwidth - Compare query time to verify optimization working

Common Pitfalls to Avoid

**Using SELECT *** - Downloads all columns, wastes bandwidth
ORDER BY without WHERE - Sorts entire file (downloads everything)
Repeated identical queries - Cache locally after first query
Low-selectivity filters - WHERE volume > 0 doesn't skip row groups
GitHub Releases direct URLs - Don't support range requests, use jsDelivr proxy
HTTP instead of HTTPS - Some CDNs require HTTPS for range support
No error handling - Network failures should raise, not return empty results

Troubleshooting

Query downloads full file (not using range requests)

Diagnose:

curl -I https://your-url/data.parquet | grep "Accept-Ranges"

Solutions:

If no Accept-Ranges: bytes: Use CDN proxy (jsDelivr)
If HTTP: Switch to HTTPS
If GitHub Releases: Use https://cdn.jsdelivr.net/gh/org/repo@tag/file.parquet

Query is slow (>5 seconds for small result)