| name | bigquery-ethereum-data-acquisition |
| description | Workflow for acquiring historical Ethereum blockchain data using Google BigQuery free tier. Empirically validated for cost estimation, streaming downloads, and DuckDB integration. Use when planning bulk historical data acquisition or comparing data source options for blockchain network metrics. |
BigQuery Ethereum Data Acquisition
Overview
Workflow for acquiring 5 years of Ethereum blockchain data (2020-2025, ~12.44M blocks) using Google BigQuery's public dataset within free tier limits. Includes column selection analysis for ML/time-series forecasting, cost optimization, and empirically validated download workflows.
Status: ✅ Empirically validated (v0.2.0, 2025-11-07)
When to Use This Skill
Use when:
- Evaluating BigQuery as a data source for Ethereum historical data
- Planning bulk historical blockchain data acquisition
- Optimizing column selection for feature engineering
- Calculating query costs and free tier utilization
- Comparing BigQuery vs RPC polling approaches
- Streaming downloads without BigQuery storage
Core Workflow
This skill follows a 5-step workflow for acquiring Ethereum data from BigQuery:
| Step | Purpose | Output | Key Metric |
|---|---|---|---|
| 1. Free Tier Limits | Understand query/storage limits | Limits documented | 1 TB query, 10 GB storage |
| 2. Column Selection | Optimize for ML/time-series | 11 columns chosen | 0.97 GB (97% savings) |
| 3. Cost Validation | Dry-run query cost | Cost estimate | 0.1% of free tier |
| 4. Stream Download | Parquet download (no storage) | .parquet file | 62 bytes/row |
| 5. DuckDB Import | Load for analysis | DuckDB database | <100ms query time |
Detailed workflow: See references/workflow-steps.md for complete step-by-step guide with SQL queries, bash commands, and validated results for each step.
Quick start: Run uv run scripts/test_bigquery_cost.py to validate cost, then uv run scripts/download_bigquery_to_parquet.py <start> <end> <output> to download.
Cost Analysis
Optimized selection: 11 columns = 0.97 GB (0.1% of free tier, 97% cost savings vs all 23 columns)
Key finding: BigQuery is 624x faster than RPC polling (<1 hour vs 26 days for 12.44M blocks).
Full analysis: See references/cost-analysis.md for detailed cost comparison, column selection rationale, and RPC provider comparison.
Prerequisites
One-time setup: gcloud auth, Python dependencies (google-cloud-bigquery, pandas, pyarrow, db-dtypes)
Setup guide: See references/setup-guide.md for complete authentication setup, dependency installation, and verification commands.
Scripts
Validated scripts (v0.2.0):
test_bigquery_cost.py- Dry-run cost estimation (0.97 GB for 12.44M blocks)download_bigquery_to_parquet.py- Streaming Parquet download (62 bytes/row validated)
Templates and usage: See scripts/README.md for complete usage examples, dependencies, and validated results.
References
Workflow Documentation
references/workflow-steps.md- Complete 5-step workflow with SQL queries, bash commands, and validated resultsreferences/cost-analysis.md- Detailed cost comparison, column selection rationale, RPC provider comparisonreferences/setup-guide.md- Authentication setup, dependencies, verification commands
Research Documents
references/bigquery_cost_comparison.md- Empirical cost analysis (6 vs 11 vs 23 columns)references/ethereum_columns_ml_evaluation.md- Column-by-column ML value analysisreferences/bigquery_complete_ethereum_data.md- Complete dataset catalog (11 tables)references/bigquery_cost_estimate.md- Free tier limits and methodologyreferences/littleblack-hardware-report.md- Local vs cloud hardware comparison
Scripts & Validation
scripts/README.md- Complete script usage guide with validated resultsVALIDATION_STATUS.md- Empirical test results, testing methodology, dependencies validated
Verification After Acquisition
Important: This skill covers data acquisition from BigQuery (downloading historical Ethereum blocks), but does NOT verify the data actually landed in ClickHouse.
After completing BigQuery download:
- Use the historical-backfill-execution skill
- Run
scripts/clickhouse/verify_blocks.pyto verify database state - Confirm expected block count (~23.8M blocks for 2015-2025 backfill)
Common workflow:
- Download from BigQuery using this skill (Step 4 above)
- Insert to ClickHouse via
chunked_backfill.sh - Verify ClickHouse state using historical-backfill-execution skill
- Check yearly breakdown to ensure complete coverage
See historical-backfill-execution skill for database verification and troubleshooting missing data.
Related Skills
historical-backfill-execution- ClickHouse database verification and backfill operationsblockchain-rpc-provider-research- RPC rate limit comparison and provider evaluationblockchain-data-collection-validation- Empirical validation workflow for data pipelines