| name | data-ingestion-builder |
| description | Build new data ingestion providers following the FF Analytics registry pattern. This skill should be used when adding new data sources (APIs, files, databases) to the data pipeline. Guides through creating provider packages, registry mappings, loader functions, storage integration, primary key tests, and sampling tools following established patterns. |
Data Ingestion Provider Builder
Create complete data ingestion providers for the Fantasy Football Analytics project following established patterns. This skill automates the process of adding new data sources with proper structure, metadata, testing, and integration.
When to Use This Skill
Use this skill proactively when:
- Adding a new data source to the pipeline (API, file, database)
- User mentions integrating data from a new provider
- User asks about "adding a provider" or "new data source"
- User references specific APIs or data sources to integrate (e.g., "add ESPN API", "integrate PFF data")
- Expanding data coverage for analytics
Provider Integration Philosophy
The FF Analytics project follows these principles for data ingestion:
- Registry Pattern - Central mapping of datasets to loaders
- Storage Abstraction - Uniform Parquet output with metadata sidecars
- Metadata-First - Every load produces
_meta.jsonwith lineage - Testable Samples - Primary key validation on sample data
- Local & Cloud - Same code works for local paths and
gs://URIs
Provider Building Workflow
Follow this six-step process to create a complete provider:
Step 1: Understand the Data Source
Before coding, gather information about the provider:
Ask clarifying questions:
- What datasets does this provider offer?
- What is the API/file format?
- What are the authentication requirements?
- What are the primary keys for each dataset?
- Are there rate limits or ToS considerations?
- What is the update frequency?
Research existing documentation:
- API documentation URLs
- Data schemas and field descriptions
- Authentication methods
- Rate limiting policies
Output: Clear understanding of:
- Dataset names and descriptions
- Primary keys for each dataset
- Authentication approach
- Any special considerations
Step 2: Design the Registry
Map datasets to loader functions and define metadata.
Use assets/registry_template.py as starting point.
For each dataset, define:
name: Logical dataset name (lowercase, descriptive)loader_function: Function name in loader.pyprimary_keys: Tuple of columns that uniquely identify rowsdescription: Brief description of dataset contentsnotes: Special considerations, dependencies, or caveats
Example registry design:
REGISTRY = {
"players": DatasetSpec(
name="players",
loader_function="load_players",
primary_keys=("player_id",),
description="Player biographical and career data",
notes="Updates daily. Includes active and retired players."
),
"stats": DatasetSpec(
name="stats",
loader_function="load_stats",
primary_keys=("player_id", "game_id", "stat_type"),
description="Game-level player statistics",
notes="Grain: one row per player per game per stat type"
)
}
Quality checks:
- Primary keys are truly unique for the grain
- Dataset names are descriptive and consistent
- Loader function names follow
load_{dataset_name}pattern
Step 3: Create Provider Package Structure
Create the directory structure following the template.
See assets/package_structure.md for complete structure.
Create directories:
mkdir -p src/ingest/{provider}
mkdir -p tests
mkdir -p samples/{provider}
Create files:
src/ingest/{provider}/__init__.py(empty or with exports)src/ingest/{provider}/registry.py(from Step 2)src/ingest/{provider}/loader.py(will implement in Step 4)tests/test_{provider}_samples_pk.py(will implement in Step 5)
Naming:
- Provider name: lowercase, underscore-separated
- Example:
nflverse,espn_api,my_provider
Step 4: Implement Loader Functions
Create loader functions using storage helper pattern.
Use assets/loader_template.py as starting point.
For each dataset in registry:
Create loader function following signature:
def load_{dataset_name}( out_dir: str = "data/raw/{provider}", **kwargs ) -> dict[str, Any]:Implement data fetching:
- API calls with proper authentication
- File parsing (CSV, JSON, XML, etc.)
- Database queries
- Handle pagination, retries, error cases
Convert to DataFrame:
- Prefer Polars for performance
- Pandas acceptable for compatibility
- Ensure consistent column types
Write with storage helper:
from ingest.common.storage import write_parquet_any, write_text_sidecar # Write Parquet write_parquet_any(df, parquet_file) # Write metadata sidecar metadata = { "dataset": dataset_name, "asof_datetime": datetime.now(UTC).isoformat(), "loader_path": "src.ingest.{provider}.loader.load_{dataset}", "source_name": "{PROVIDER}", "source_version": version, "output_parquet": parquet_file, "row_count": len(df) } write_text_sidecar(json.dumps(metadata, indent=2), f"{partition_dir}/_meta.json")Return manifest:
return { "dataset": dataset_name, "partition_dir": partition_dir, "parquet_file": parquet_file, "row_count": len(df), "metadata": metadata }
Reference examples:
references/example_loader.py- Complete nflverse loaderreferences/example_storage.py- Storage helper implementation
Common patterns:
- Use
datetime.now(UTC)for all timestamps - Generate UUIDs for file names:
uuid.uuid4().hex[:8] - Partition by date:
dt=YYYY-MM-DD - Handle both local paths and
gs://URIs uniformly
Step 5: Create Primary Key Tests
Validate sample data quality with automated tests.
Use assets/test_template.py as starting point.
Test structure:
@pytest.mark.parametrize("dataset_name,spec", REGISTRY.items())
def test_{provider}_primary_keys(dataset_name, spec):
# 1. Find sample files
# 2. Read with Polars
# 3. Check PK columns exist
# 4. Check PK uniqueness
# 5. Report duplicates if found
What to test:
- Primary key columns exist in dataset
- Primary key uniqueness (no duplicates)
- Sample data is non-empty
- Metadata sidecars exist and are valid
Run tests:
pytest tests/test_{provider}_samples_pk.py -v
Step 6: Integrate with Project Tooling
Connect the provider to existing workflows.
Update tools/make_samples.py:
Add provider-specific sampling logic:
# In make_samples.py argument parser
elif args.provider == "{provider}":
from ingest.{provider}.loader import load_{dataset}
# Provider-specific argument parsing
datasets = args.datasets or ["default_dataset"]
for dataset in datasets:
result = load_{dataset}(
out_dir=args.out,
**provider_kwargs
)
print(f"✓ Sampled {dataset}: {result['row_count']} rows")
Update documentation:
src/ingest/CLAUDE.md- Add provider-specific notes- Root
CLAUDE.md- If architecturally significant README.md- If user-facing
Create sample data:
uv run python tools/make_samples.py {provider} --datasets {dataset1} {dataset2} --out ./samples
Validate:
# Check sample data created
ls -la samples/{provider}/
# Run PK tests
pytest tests/test_{provider}_samples_pk.py -v
# Check metadata
cat samples/{provider}/{dataset}/dt=*/_meta.json | jq .
Resources Provided
references/
Provider implementation examples from codebase:
- example_registry.py - Complete registry from nflverse with 10+ datasets
- example_loader.py - Nflverse shim loader with Python/R fallback pattern
- example_storage.py - Storage helper with local and GCS support
Load these references when implementing a new provider to see proven patterns.
assets/
Templates for creating new providers:
- registry_template.py - Registry.py skeleton with placeholders
- loader_template.py - Loader function template with storage helpers
- test_template.py - Primary key test template with pytest
- package_structure.md - Complete directory structure and integration guide
Use these templates directly when generating provider code.
Best Practices
Registry Design
- Accurate primary keys - Test with real data to verify uniqueness
- Descriptive names - Use clear, consistent dataset names
- Document grain - Notes should explain row-level granularity
- Consider joins - Design PKs to enable joins with other datasets
Loader Implementation
- Handle failures gracefully - Return empty DataFrames with metadata on errors
- Include traceability - Capture input parameters in metadata
- Respect rate limits - Add delays, implement exponential backoff
- Validate before writing - Check schema, row counts, nulls
- Use storage helpers - Don't reimplement Parquet writing
Testing
- Test with real samples - Use actual provider data, not mocks
- Cover all datasets - Parametrize tests across registry
- Check metadata completeness - Validate all required fields
- Document expected failures - If some rows expected to fail PK tests
Integration
- Update make_samples.py - Enable easy sample generation
- Document requirements - Note authentication, dependencies, setup
- Add to CLAUDE.md - Help future developers understand the provider
- Consider CI/CD - Add to GitHub Actions if automated refresh needed
Common Patterns
Authentication
Environment variables:
import os
api_key = os.environ.get("{PROVIDER}_API_KEY")
if not api_key:
raise ValueError("Set {PROVIDER}_API_KEY environment variable")
OAuth flow:
from requests_oauthlib import OAuth2Session
oauth = OAuth2Session(client_id, token=token)
response = oauth.get(endpoint)
Pagination
Offset-based:
all_data = []
offset = 0
limit = 100
while True:
response = fetch(offset=offset, limit=limit)
data = response.json()
all_data.extend(data)
if len(data) < limit:
break
offset += limit
Cursor-based:
all_data = []
cursor = None
while True:
response = fetch(cursor=cursor)
data = response.json()
all_data.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
Rate Limiting
Simple delay:
import time
for dataset in datasets:
result = load_dataset()
time.sleep(1) # 1 second between requests
Exponential backoff:
import time
from requests.exceptions import HTTPError
max_retries = 3
for attempt in range(max_retries):
try:
response = fetch()
response.raise_for_status()
break
except HTTPError as e:
if e.response.status_code == 429: # Rate limit
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
raise
Output Format
When helping user create a provider:
After Step 2 (Registry Design):
✅ Registry Designed: {provider} Datasets defined: - {dataset1}: {description} (PK: {pk_columns}) - {dataset2}: {description} (PK: {pk_columns}) Ready to create package structure (Step 3)?After Step 4 (Loader Implementation):
✅ Loaders Implemented Created loader functions: - load_{dataset1}() - Fetches from {source} - load_{dataset2}() - Fetches from {source} All loaders use storage helpers and write metadata sidecars. Ready to create tests (Step 5)?After Step 6 (Integration Complete):
✅ Provider Integration Complete: {provider} Created: - Registry: src/ingest/{provider}/registry.py ({N} datasets) - Loaders: src/ingest/{provider}/loader.py - Tests: tests/test_{provider}_samples_pk.py - Samples: samples/{provider}/ ({N} datasets) Integration: - ✓ Added to tools/make_samples.py - ✓ Updated documentation - ✓ Primary key tests passing ({N}/{N}) To use: ```bash # Generate samples uv run python tools/make_samples.py {provider} --datasets all --out ./samples # Run tests pytest tests/test_{provider}_samples_pk.py -v # Use in production from ingest.{provider}.loader import load_{dataset} result = load_{dataset}(out_dir="gs://ff-analytics/raw/{provider}")
Handling User Scenarios
Scenario: User wants to add a specific API
User says: "Add integration for the ESPN Fantasy API"
Response:
- Begin Step 1 (Understand the Data Source)
- Ask clarifying questions about ESPN API
- Guide through all 6 steps to complete integration
Scenario: User has API docs, needs implementation
User says: "I have the API docs for PFF, help me integrate it"
Response:
- Ask user to share key details (datasets, auth, PKs)
- Begin Step 2 (Design Registry)
- Proceed through implementation steps
Scenario: User wants to fix existing provider
User says: "The nflverse loader is missing a dataset"
Response:
- Read existing provider registry and loaders
- Add new dataset to registry (Step 2)
- Implement loader for new dataset (Step 4)
- Update tests and samples (Steps 5-6)
Troubleshooting
Issue: Primary key tests failing
- Review data grain - are PKs actually unique?
- Check for null values in PK columns
- Verify sample data represents full population
- Consider composite keys if single column insufficient
Issue: Storage helper fails with GCS
- Check
GOOGLE_APPLICATION_CREDENTIALSenvironment variable - Verify GCS bucket permissions
- Test with local path first, then GCS
- Review
references/example_storage.pyfor patterns
Issue: Loader returns empty data
- Check authentication credentials
- Verify API endpoint URLs
- Review rate limiting and retries
- Add debug logging to data fetching
Issue: Make_samples.py not finding provider
- Ensure provider package in
src/ingest/{provider}/ - Check PYTHONPATH includes src/
- Verify imports in make_samples.py
- Run from repo root directory
Integration with Other Skills
This skill works well with:
- dbt-model-builder - After ingestion, create staging models for the provider
- data-quality-test-generator - Add comprehensive tests beyond primary keys
- data-architecture-spec1 - Ensure provider follows SPEC-1 patterns