Claude Code Plugins

Community-maintained marketplace

Feedback

historical-backfill-execution

@terrylica/gapless-network-data
0
0

Execute chunked historical blockchain data backfills using canonical 1-year pattern. Use when loading multi-year historical data, filling gaps in ClickHouse, or preventing OOM failures on Cloud Run. Keywords chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name historical-backfill-execution
description Execute chunked historical blockchain data backfills using canonical 1-year pattern. Use when loading multi-year historical data, filling gaps in ClickHouse, or preventing OOM failures on Cloud Run. Keywords chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill.

Historical Backfill Execution

Version: 2.0.0 Last Updated: 2025-11-29 Purpose: Execute chunked historical blockchain data backfills using canonical 1-year pattern

When to Use

Use this skill when:

  • Loading multi-year historical data (e.g., 2015-2025 Ethereum blocks, 23.8M blocks total)
  • Gaps detected in ClickHouse requiring backfill
  • Preventing OOM (Out of Memory) failures on Cloud Run (4GB memory limit)
  • Need to execute complete historical data collection
  • Keywords: chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill

Canonical Pattern (Established 2025-11-10)

Empirically Validated Approach:

  • Chunk size: 1 year (~2.6M blocks for Ethereum)
  • Memory usage: <4GB per chunk (Cloud Run safe, no OOM errors)
  • Execution time: ~1m40s-2m per chunk (total ~20 minutes for 10 years)
  • Idempotency: ReplacingMergeTree allows safe re-runs (automatic deduplication)
  • Cost: $0/month (within BigQuery 1TB/month free tier, ~10MB per query)

Why 1-Year Chunks?

See Backfill Patterns Reference for complete rationale (memory constraints, retry granularity, progress tracking).

Prerequisites

  • GCP project access: eonlabs-ethereum-bq
  • BigQuery dataset: bigquery-public-data.crypto_ethereum.blocks
  • ClickHouse Cloud database: ethereum_mainnet.blocks
  • ClickHouse credentials configured (via Doppler or environment variable)
  • Python dependencies: google-cloud-bigquery, pyarrow, clickhouse-connect

Workflow

1. Execute Chunked Backfill

Run the canonical 1-year chunking script:

cd /deployment/backfill
./chunked_backfill.sh 2015 2025

What This Does:

  • Loads blocks year-by-year from BigQuery
  • Uses PyArrow for efficient data transfer
  • Inserts into ClickHouse with ReplacingMergeTree (automatic deduplication)
  • Shows progress for each year: "Loading blocks 1 - 2,600,000"

Alternative (use provided validation wrapper):

.claude/skills/historical-backfill-execution/scripts/chunked_executor.sh 2015 2025

2. Monitor Progress

Watch the output for year-by-year progress:

Loading blocks for year 2015...
Year 2015: Loading blocks 1 - 2,600,000
Completed 2015 in 1m42s

Loading blocks for year 2016...
Year 2016: Loading blocks 2,600,001 - 5,200,000
Completed 2016 in 1m38s

...

All years completed successfully!
Total blocks loaded: 23,800,000
Total time: 18m45s

3. Verify Completeness

After backfill completes, verify all blocks loaded:

cd 
doppler run --project aws-credentials --config prd -- python3 -c "
import clickhouse_connect
import os
client = clickhouse_connect.get_client(
    host=os.environ['CLICKHOUSE_HOST'],
    port=8443,
    username='default',
    password=os.environ['CLICKHOUSE_PASSWORD'],
    secure=True
)
result = client.query('SELECT COUNT(*) as total, MIN(number) as min_block, MAX(number) as max_block FROM ethereum_mainnet.blocks FINAL')
print(f'Total blocks: {result.result_rows[0][0]:,}')
print(f'Block range: {result.result_rows[0][1]:,} to {result.result_rows[0][2]:,}')
"

Expected Output (healthy):

Total blocks: 23,800,000+
Block range: 1 to 23,800,000+

4. Detect Gaps (If Needed)

Run gap detection to ensure zero missing blocks. The gap monitor Cloud Function runs every 3 hours automatically. For manual checks:

# Trigger manual gap check via Cloud Scheduler
gcloud scheduler jobs run motherduck-monitor-trigger --location=us-east1

# View logs
gcloud functions logs read motherduck-gap-detector --region=us-east1 --gen2 --limit=50

5. Handle Specific Year Range (Partial Backfill)

Load specific years only:

# Load only 2023-2024
cd deployment/backfill
./chunked_backfill.sh 2023 2024

Use Cases:

  • Filling detected gaps in specific years
  • Testing backfill process on small range
  • Recovering from failed backfill (resume from specific year)

Memory Management

Cloud Run Limits

Constraint: 4GB memory limit per Cloud Run Job execution

Solution: 1-year chunks keep memory usage <4GB per execution

What Happens on OOM:

  • Cloud Run Job exits with code 137 (SIGKILL)
  • Error: "Memory limit exceeded"
  • Solution: Reduce chunk size or increase Cloud Run memory allocation

Local Execution

For local testing with memory constraints:

# Validate memory requirements before execution
.claude/skills/historical-backfill-execution/scripts/validate_chunk_size.py --year 2020

Output:

Estimating memory for 2020 backfill...
Block count: ~2,600,000
Column count: 11 (optimized schema)
Expected memory: ~3.2 GB
Cloud Run safe: (under 4GB limit)

Troubleshooting

See Troubleshooting Reference for complete guide.

Common Issues:

Issue Cause Solution
OOM error (code 137) Chunk too large Reduce year range (6 months instead of 1 year)
BigQuery quota exceeded >1TB in 30 days Wait for quota reset or reduce query frequency
"Permission denied" Missing IAM roles Grant roles/bigquery.user to service account
"Table not found" Wrong dataset Verify bigquery-public-data.crypto_ethereum.blocks
Slow execution (>5min/year) Network issues Check Cloud Run region (use same as BigQuery: us)

Backfill Patterns

See Backfill Patterns Reference for alternatives and rationale.

Pattern Comparison:

Pattern Chunk Size Memory Time (10yr) Retry Granularity Recommended
1-Year Chunks ~2.6M blocks <4GB 20 min Year-level Yes (canonical)
Month Chunks ~220K blocks <1GB 35 min Month-level Over-chunked (slower)
Full Load 26M blocks >8GB N/A All-or-nothing No (OOM errors)

Operational History

Complete Historical Backfill (2025-11-10):

  • Goal: Load 23.8M Ethereum blocks (2015-2025)
  • Method: 1-year chunked backfill via chunked_backfill.sh
  • Execution Time: ~20 minutes total
  • Memory Usage: <4GB per chunk (no OOM errors)
  • Result: 100% completeness, zero gaps, zero duplicates
  • Verification: 23.8M blocks in ClickHouse Cloud, latest block within 60 seconds
  • Cost: $0 (within BigQuery free tier)

SLO Achievement: Complete historical data collection (10 years, 23.8M blocks) in <30 minutes with zero manual intervention.

Related Documentation

Scripts

References