name	historical-backfill-execution
description	Execute chunked historical blockchain data backfills using canonical 1-year pattern. Use when loading multi-year historical data, filling gaps in ClickHouse, or preventing OOM failures on Cloud Run. Keywords chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill.

Historical Backfill Execution

Version: 2.0.0 Last Updated: 2025-11-29 Purpose: Execute chunked historical blockchain data backfills using canonical 1-year pattern

When to Use

Use this skill when:

Loading multi-year historical data (e.g., 2015-2025 Ethereum blocks, 23.8M blocks total)
Gaps detected in ClickHouse requiring backfill
Preventing OOM (Out of Memory) failures on Cloud Run (4GB memory limit)
Need to execute complete historical data collection
Keywords: chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill

Canonical Pattern (Established 2025-11-10)

Empirically Validated Approach:

Chunk size: 1 year (~2.6M blocks for Ethereum)
Memory usage: <4GB per chunk (Cloud Run safe, no OOM errors)
Execution time: ~1m40s-2m per chunk (total ~20 minutes for 10 years)
Idempotency: ReplacingMergeTree allows safe re-runs (automatic deduplication)
Cost: $0/month (within BigQuery 1TB/month free tier, ~10MB per query)

Why 1-Year Chunks?

See Backfill Patterns Reference for complete rationale (memory constraints, retry granularity, progress tracking).

Prerequisites

GCP project access: eonlabs-ethereum-bq
BigQuery dataset: bigquery-public-data.crypto_ethereum.blocks
ClickHouse Cloud database: ethereum_mainnet.blocks
ClickHouse credentials configured (via Doppler or environment variable)
Python dependencies: google-cloud-bigquery, pyarrow, clickhouse-connect

Workflow

1. Execute Chunked Backfill

Run the canonical 1-year chunking script:

cd /deployment/backfill
./chunked_backfill.sh 2015 2025

What This Does:

Loads blocks year-by-year from BigQuery
Uses PyArrow for efficient data transfer
Inserts into ClickHouse with ReplacingMergeTree (automatic deduplication)
Shows progress for each year: "Loading blocks 1 - 2,600,000"

Alternative (use provided validation wrapper):

.claude/skills/historical-backfill-execution/scripts/chunked_executor.sh 2015 2025

2. Monitor Progress

Watch the output for year-by-year progress:

Loading blocks for year 2015...
Year 2015: Loading blocks 1 - 2,600,000
Completed 2015 in 1m42s

Loading blocks for year 2016...
Year 2016: Loading blocks 2,600,001 - 5,200,000
Completed 2016 in 1m38s

...

All years completed successfully!
Total blocks loaded: 23,800,000
Total time: 18m45s

3. Verify Completeness

After backfill completes, verify all blocks loaded:

cd 
doppler run --project aws-credentials --config prd -- python3 -c "
import clickhouse_connect
import os
client = clickhouse_connect.get_client(
    host=os.environ['CLICKHOUSE_HOST'],
    port=8443,
    username='default',
    password=os.environ['CLICKHOUSE_PASSWORD'],
    secure=True
)
result = client.query('SELECT COUNT(*) as total, MIN(number) as min_block, MAX(number) as max_block FROM ethereum_mainnet.blocks FINAL')
print(f'Total blocks: {result.result_rows[0][0]:,}')
print(f'Block range: {result.result_rows[0][1]:,} to {result.result_rows[0][2]:,}')
"

Expected Output (healthy):

Total blocks: 23,800,000+
Block range: 1 to 23,800,000+

4. Detect Gaps (If Needed)

Run gap detection to ensure zero missing blocks. The gap monitor Cloud Function runs every 3 hours automatically. For manual checks:

# Trigger manual gap check via Cloud Scheduler
gcloud scheduler jobs run motherduck-monitor-trigger --location=us-east1

# View logs
gcloud functions logs read motherduck-gap-detector --region=us-east1 --gen2 --limit=50

5. Handle Specific Year Range (Partial Backfill)

Load specific years only:

# Load only 2023-2024
cd deployment/backfill
./chunked_backfill.sh 2023 2024

Use Cases:

Filling detected gaps in specific years
Testing backfill process on small range
Recovering from failed backfill (resume from specific year)

Memory Management

Cloud Run Limits

Constraint: 4GB memory limit per Cloud Run Job execution

Solution: 1-year chunks keep memory usage <4GB per execution

What Happens on OOM:

Cloud Run Job exits with code 137 (SIGKILL)
Error: "Memory limit exceeded"
Solution: Reduce chunk size or increase Cloud Run memory allocation

Local Execution

For local testing with memory constraints:

# Validate memory requirements before execution
.claude/skills/historical-backfill-execution/scripts/validate_chunk_size.py --year 2020

Output:

Estimating memory for 2020 backfill...
Block count: ~2,600,000
Column count: 11 (optimized schema)
Expected memory: ~3.2 GB
Cloud Run safe: (under 4GB limit)

Troubleshooting

See Troubleshooting Reference for complete guide.

Common Issues:

Issue	Cause	Solution
OOM error (code 137)	Chunk too large	Reduce year range (6 months instead of 1 year)
BigQuery quota exceeded	>1TB in 30 days	Wait for quota reset or reduce query frequency
"Permission denied"	Missing IAM roles	Grant `roles/bigquery.user` to service account
"Table not found"	Wrong dataset	Verify `bigquery-public-data.crypto_ethereum.blocks`
Slow execution (>5min/year)	Network issues	Check Cloud Run region (use same as BigQuery: `us`)

Backfill Patterns

See Backfill Patterns Reference for alternatives and rationale.

Pattern Comparison:

Pattern	Chunk Size	Memory	Time (10yr)	Retry Granularity	Recommended
1-Year Chunks	~2.6M blocks	<4GB	20 min	Year-level	Yes (canonical)
Month Chunks	~220K blocks	<1GB	35 min	Month-level	Over-chunked (slower)
Full Load	26M blocks	>8GB	N/A	All-or-nothing	No (OOM errors)

Operational History

Complete Historical Backfill (2025-11-10):

Goal: Load 23.8M Ethereum blocks (2015-2025)
Method: 1-year chunked backfill via chunked_backfill.sh
Execution Time: ~20 minutes total
Memory Usage: <4GB per chunk (no OOM errors)
Result: 100% completeness, zero gaps, zero duplicates
Verification: 23.8M blocks in ClickHouse Cloud, latest block within 60 seconds
Cost: $0 (within BigQuery free tier)

SLO Achievement: Complete historical data collection (10 years, 23.8M blocks) in <30 minutes with zero manual intervention.

Scripts

`validate_chunk_size.py` - Estimate memory requirements before execution
`chunked_executor.sh` - Wrapper for deployment/backfill/chunked_backfill.sh with validation

References

`backfill-patterns.md` - 1-year chunking rationale, comparison with alternatives
`troubleshooting.md` - OOM errors, retry strategies, Cloud Run logs analysis

historical-backfill-execution

Install Skill

SKILL.md