| name | historical-backfill-execution |
| description | Execute chunked historical blockchain data backfills using canonical 1-year pattern. Use when loading multi-year historical data, filling gaps in ClickHouse, or preventing OOM failures on Cloud Run. Keywords chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill. |
Historical Backfill Execution
Version: 2.0.0 Last Updated: 2025-11-29 Purpose: Execute chunked historical blockchain data backfills using canonical 1-year pattern
When to Use
Use this skill when:
- Loading multi-year historical data (e.g., 2015-2025 Ethereum blocks, 23.8M blocks total)
- Gaps detected in ClickHouse requiring backfill
- Preventing OOM (Out of Memory) failures on Cloud Run (4GB memory limit)
- Need to execute complete historical data collection
- Keywords: chunked_backfill.sh, BigQuery historical, gap filling, memory-safe backfill
Canonical Pattern (Established 2025-11-10)
Empirically Validated Approach:
- Chunk size: 1 year (~2.6M blocks for Ethereum)
- Memory usage: <4GB per chunk (Cloud Run safe, no OOM errors)
- Execution time: ~1m40s-2m per chunk (total ~20 minutes for 10 years)
- Idempotency: ReplacingMergeTree allows safe re-runs (automatic deduplication)
- Cost: $0/month (within BigQuery 1TB/month free tier, ~10MB per query)
Why 1-Year Chunks?
See Backfill Patterns Reference for complete rationale (memory constraints, retry granularity, progress tracking).
Prerequisites
- GCP project access:
eonlabs-ethereum-bq - BigQuery dataset:
bigquery-public-data.crypto_ethereum.blocks - ClickHouse Cloud database:
ethereum_mainnet.blocks - ClickHouse credentials configured (via Doppler or environment variable)
- Python dependencies:
google-cloud-bigquery,pyarrow,clickhouse-connect
Workflow
1. Execute Chunked Backfill
Run the canonical 1-year chunking script:
cd /deployment/backfill
./chunked_backfill.sh 2015 2025
What This Does:
- Loads blocks year-by-year from BigQuery
- Uses PyArrow for efficient data transfer
- Inserts into ClickHouse with ReplacingMergeTree (automatic deduplication)
- Shows progress for each year: "Loading blocks 1 - 2,600,000"
Alternative (use provided validation wrapper):
.claude/skills/historical-backfill-execution/scripts/chunked_executor.sh 2015 2025
2. Monitor Progress
Watch the output for year-by-year progress:
Loading blocks for year 2015...
Year 2015: Loading blocks 1 - 2,600,000
Completed 2015 in 1m42s
Loading blocks for year 2016...
Year 2016: Loading blocks 2,600,001 - 5,200,000
Completed 2016 in 1m38s
...
All years completed successfully!
Total blocks loaded: 23,800,000
Total time: 18m45s
3. Verify Completeness
After backfill completes, verify all blocks loaded:
cd
doppler run --project aws-credentials --config prd -- python3 -c "
import clickhouse_connect
import os
client = clickhouse_connect.get_client(
host=os.environ['CLICKHOUSE_HOST'],
port=8443,
username='default',
password=os.environ['CLICKHOUSE_PASSWORD'],
secure=True
)
result = client.query('SELECT COUNT(*) as total, MIN(number) as min_block, MAX(number) as max_block FROM ethereum_mainnet.blocks FINAL')
print(f'Total blocks: {result.result_rows[0][0]:,}')
print(f'Block range: {result.result_rows[0][1]:,} to {result.result_rows[0][2]:,}')
"
Expected Output (healthy):
Total blocks: 23,800,000+
Block range: 1 to 23,800,000+
4. Detect Gaps (If Needed)
Run gap detection to ensure zero missing blocks. The gap monitor Cloud Function runs every 3 hours automatically. For manual checks:
# Trigger manual gap check via Cloud Scheduler
gcloud scheduler jobs run motherduck-monitor-trigger --location=us-east1
# View logs
gcloud functions logs read motherduck-gap-detector --region=us-east1 --gen2 --limit=50
5. Handle Specific Year Range (Partial Backfill)
Load specific years only:
# Load only 2023-2024
cd deployment/backfill
./chunked_backfill.sh 2023 2024
Use Cases:
- Filling detected gaps in specific years
- Testing backfill process on small range
- Recovering from failed backfill (resume from specific year)
Memory Management
Cloud Run Limits
Constraint: 4GB memory limit per Cloud Run Job execution
Solution: 1-year chunks keep memory usage <4GB per execution
What Happens on OOM:
- Cloud Run Job exits with code 137 (SIGKILL)
- Error: "Memory limit exceeded"
- Solution: Reduce chunk size or increase Cloud Run memory allocation
Local Execution
For local testing with memory constraints:
# Validate memory requirements before execution
.claude/skills/historical-backfill-execution/scripts/validate_chunk_size.py --year 2020
Output:
Estimating memory for 2020 backfill...
Block count: ~2,600,000
Column count: 11 (optimized schema)
Expected memory: ~3.2 GB
Cloud Run safe: (under 4GB limit)
Troubleshooting
See Troubleshooting Reference for complete guide.
Common Issues:
| Issue | Cause | Solution |
|---|---|---|
| OOM error (code 137) | Chunk too large | Reduce year range (6 months instead of 1 year) |
| BigQuery quota exceeded | >1TB in 30 days | Wait for quota reset or reduce query frequency |
| "Permission denied" | Missing IAM roles | Grant roles/bigquery.user to service account |
| "Table not found" | Wrong dataset | Verify bigquery-public-data.crypto_ethereum.blocks |
| Slow execution (>5min/year) | Network issues | Check Cloud Run region (use same as BigQuery: us) |
Backfill Patterns
See Backfill Patterns Reference for alternatives and rationale.
Pattern Comparison:
| Pattern | Chunk Size | Memory | Time (10yr) | Retry Granularity | Recommended |
|---|---|---|---|---|---|
| 1-Year Chunks | ~2.6M blocks | <4GB | 20 min | Year-level | Yes (canonical) |
| Month Chunks | ~220K blocks | <1GB | 35 min | Month-level | Over-chunked (slower) |
| Full Load | 26M blocks | >8GB | N/A | All-or-nothing | No (OOM errors) |
Operational History
Complete Historical Backfill (2025-11-10):
- Goal: Load 23.8M Ethereum blocks (2015-2025)
- Method: 1-year chunked backfill via
chunked_backfill.sh - Execution Time: ~20 minutes total
- Memory Usage: <4GB per chunk (no OOM errors)
- Result: 100% completeness, zero gaps, zero duplicates
- Verification: 23.8M blocks in ClickHouse Cloud, latest block within 60 seconds
- Cost: $0 (within BigQuery free tier)
SLO Achievement: Complete historical data collection (10 years, 23.8M blocks) in <30 minutes with zero manual intervention.
Related Documentation
- BigQuery Ethereum Data Acquisition Skill - Column selection rationale
- ClickHouse Migration ADR - Production database migration
- Gap Monitor README - Automated gap detection
Scripts
- `validate_chunk_size.py` - Estimate memory requirements before execution
- `chunked_executor.sh` - Wrapper for
deployment/backfill/chunked_backfill.shwith validation
References
- `backfill-patterns.md` - 1-year chunking rationale, comparison with alternatives
- `troubleshooting.md` - OOM errors, retry strategies, Cloud Run logs analysis