name	data-ingestion-builder
description	Build new data ingestion providers following the FF Analytics registry pattern. This skill should be used when adding new data sources (APIs, files, databases) to the data pipeline. Guides through creating provider packages, registry mappings, loader functions, storage integration, primary key tests, and sampling tools following established patterns.

Data Ingestion Provider Builder

Create complete data ingestion providers for the Fantasy Football Analytics project following established patterns. This skill automates the process of adding new data sources with proper structure, metadata, testing, and integration.

When to Use This Skill

Use this skill proactively when:

Adding a new data source to the pipeline (API, file, database)
User mentions integrating data from a new provider
User asks about "adding a provider" or "new data source"
User references specific APIs or data sources to integrate (e.g., "add ESPN API", "integrate PFF data")
Expanding data coverage for analytics

Provider Integration Philosophy

The FF Analytics project follows these principles for data ingestion:

Registry Pattern - Central mapping of datasets to loaders
Storage Abstraction - Uniform Parquet output with metadata sidecars
Metadata-First - Every load produces _meta.json with lineage
Testable Samples - Primary key validation on sample data
Local & Cloud - Same code works for local paths and gs:// URIs

Provider Building Workflow

Follow this six-step process to create a complete provider:

Step 1: Understand the Data Source

Before coding, gather information about the provider:

Ask clarifying questions:

What datasets does this provider offer?
What is the API/file format?
What are the authentication requirements?
What are the primary keys for each dataset?
Are there rate limits or ToS considerations?
What is the update frequency?

Research existing documentation:

API documentation URLs
Data schemas and field descriptions
Authentication methods
Rate limiting policies

Output: Clear understanding of:

Dataset names and descriptions
Primary keys for each dataset
Authentication approach
Any special considerations

Step 2: Design the Registry

Map datasets to loader functions and define metadata.

Use assets/registry_template.py as starting point.

For each dataset, define:

name: Logical dataset name (lowercase, descriptive)
loader_function: Function name in loader.py
primary_keys: Tuple of columns that uniquely identify rows
description: Brief description of dataset contents
notes: Special considerations, dependencies, or caveats

Example registry design:

REGISTRY = {
    "players": DatasetSpec(
        name="players",
        loader_function="load_players",
        primary_keys=("player_id",),
        description="Player biographical and career data",
        notes="Updates daily. Includes active and retired players."
    ),
    "stats": DatasetSpec(
        name="stats",
        loader_function="load_stats",
        primary_keys=("player_id", "game_id", "stat_type"),
        description="Game-level player statistics",
        notes="Grain: one row per player per game per stat type"
    )
}

Quality checks:

Primary keys are truly unique for the grain
Dataset names are descriptive and consistent
Loader function names follow load_{dataset_name} pattern

Step 3: Create Provider Package Structure

Create the directory structure following the template.

See assets/package_structure.md for complete structure.

Create directories:

mkdir -p src/ingest/{provider}
mkdir -p tests
mkdir -p samples/{provider}

Create files:

src/ingest/{provider}/__init__.py (empty or with exports)
src/ingest/{provider}/registry.py (from Step 2)
src/ingest/{provider}/loader.py (will implement in Step 4)
tests/test_{provider}_samples_pk.py (will implement in Step 5)

Naming:

Provider name: lowercase, underscore-separated
Example: nflverse, espn_api, my_provider

Step 4: Implement Loader Functions

Create loader functions using storage helper pattern.

Use assets/loader_template.py as starting point.

For each dataset in registry:

Create loader function following signature:

def load_{dataset_name}(
    out_dir: str = "data/raw/{provider}",
    **kwargs
) -> dict[str, Any]:

Implement data fetching:
- API calls with proper authentication
- File parsing (CSV, JSON, XML, etc.)
- Database queries
- Handle pagination, retries, error cases
Convert to DataFrame:
- Prefer Polars for performance
- Pandas acceptable for compatibility
- Ensure consistent column types

Write with storage helper:

from ingest.common.storage import write_parquet_any, write_text_sidecar

# Write Parquet
write_parquet_any(df, parquet_file)

# Write metadata sidecar
metadata = {
    "dataset": dataset_name,
    "asof_datetime": datetime.now(UTC).isoformat(),
    "loader_path": "src.ingest.{provider}.loader.load_{dataset}",
    "source_name": "{PROVIDER}",
    "source_version": version,
    "output_parquet": parquet_file,
    "row_count": len(df)
}
write_text_sidecar(json.dumps(metadata, indent=2), f"{partition_dir}/_meta.json")

Return manifest:

return {
    "dataset": dataset_name,
    "partition_dir": partition_dir,
    "parquet_file": parquet_file,
    "row_count": len(df),
    "metadata": metadata
}

Reference examples:

references/example_loader.py - Complete nflverse loader
references/example_storage.py - Storage helper implementation

Common patterns:

Use datetime.now(UTC) for all timestamps
Generate UUIDs for file names: uuid.uuid4().hex[:8]
Partition by date: dt=YYYY-MM-DD
Handle both local paths and gs:// URIs uniformly

Step 5: Create Primary Key Tests

Validate sample data quality with automated tests.

Use assets/test_template.py as starting point.

Test structure:

@pytest.mark.parametrize("dataset_name,spec", REGISTRY.items())
def test_{provider}_primary_keys(dataset_name, spec):
    # 1. Find sample files
    # 2. Read with Polars
    # 3. Check PK columns exist
    # 4. Check PK uniqueness
    # 5. Report duplicates if found

What to test:

Primary key columns exist in dataset
Primary key uniqueness (no duplicates)
Sample data is non-empty
Metadata sidecars exist and are valid

Run tests:

pytest tests/test_{provider}_samples_pk.py -v

Step 6: Integrate with Project Tooling

Connect the provider to existing workflows.

Update tools/make_samples.py:

Add provider-specific sampling logic:

# In make_samples.py argument parser
elif args.provider == "{provider}":
    from ingest.{provider}.loader import load_{dataset}

    # Provider-specific argument parsing
    datasets = args.datasets or ["default_dataset"]

    for dataset in datasets:
        result = load_{dataset}(
            out_dir=args.out,
            **provider_kwargs
        )
        print(f"✓ Sampled {dataset}: {result['row_count']} rows")

Update documentation:

src/ingest/CLAUDE.md - Add provider-specific notes
Root CLAUDE.md - If architecturally significant
README.md - If user-facing

Create sample data:

uv run python tools/make_samples.py {provider} --datasets {dataset1} {dataset2} --out ./samples

Validate:

# Check sample data created
ls -la samples/{provider}/

# Run PK tests
pytest tests/test_{provider}_samples_pk.py -v

# Check metadata
cat samples/{provider}/{dataset}/dt=*/_meta.json | jq .

Resources Provided

references/

Provider implementation examples from codebase:

example_registry.py - Complete registry from nflverse with 10+ datasets
example_loader.py - Nflverse shim loader with Python/R fallback pattern
example_storage.py - Storage helper with local and GCS support

Load these references when implementing a new provider to see proven patterns.

assets/

Templates for creating new providers:

registry_template.py - Registry.py skeleton with placeholders
loader_template.py - Loader function template with storage helpers
test_template.py - Primary key test template with pytest
package_structure.md - Complete directory structure and integration guide

Use these templates directly when generating provider code.

Best Practices

Registry Design

Accurate primary keys - Test with real data to verify uniqueness
Descriptive names - Use clear, consistent dataset names
Document grain - Notes should explain row-level granularity
Consider joins - Design PKs to enable joins with other datasets

Loader Implementation

Handle failures gracefully - Return empty DataFrames with metadata on errors
Include traceability - Capture input parameters in metadata
Respect rate limits - Add delays, implement exponential backoff
Validate before writing - Check schema, row counts, nulls
Use storage helpers - Don't reimplement Parquet writing

Testing

Test with real samples - Use actual provider data, not mocks
Cover all datasets - Parametrize tests across registry
Check metadata completeness - Validate all required fields
Document expected failures - If some rows expected to fail PK tests

Integration

Update make_samples.py - Enable easy sample generation
Document requirements - Note authentication, dependencies, setup
Add to CLAUDE.md - Help future developers understand the provider
Consider CI/CD - Add to GitHub Actions if automated refresh needed

Common Patterns

Authentication

Environment variables:

import os

api_key = os.environ.get("{PROVIDER}_API_KEY")
if not api_key:
    raise ValueError("Set {PROVIDER}_API_KEY environment variable")

OAuth flow:

from requests_oauthlib import OAuth2Session

oauth = OAuth2Session(client_id, token=token)
response = oauth.get(endpoint)

Pagination

Offset-based:

all_data = []
offset = 0
limit = 100

while True:
    response = fetch(offset=offset, limit=limit)
    data = response.json()
    all_data.extend(data)

    if len(data) < limit:
        break
    offset += limit

Cursor-based:

all_data = []
cursor = None

while True:
    response = fetch(cursor=cursor)
    data = response.json()
    all_data.extend(data["results"])

    cursor = data.get("next_cursor")
    if not cursor:
        break

Rate Limiting

Simple delay:

import time

for dataset in datasets:
    result = load_dataset()
    time.sleep(1)  # 1 second between requests

Exponential backoff:

import time
from requests.exceptions import HTTPError

max_retries = 3
for attempt in range(max_retries):
    try:
        response = fetch()
        response.raise_for_status()
        break
    except HTTPError as e:
        if e.response.status_code == 429:  # Rate limit
            wait_time = 2 ** attempt
            time.sleep(wait_time)
        else:
            raise

Output Format

When helping user create a provider:

After Step 2 (Registry Design):

✅ Registry Designed: {provider}

Datasets defined:
- {dataset1}: {description} (PK: {pk_columns})
- {dataset2}: {description} (PK: {pk_columns})

Ready to create package structure (Step 3)?

After Step 4 (Loader Implementation):

✅ Loaders Implemented

Created loader functions:
- load_{dataset1}() - Fetches from {source}
- load_{dataset2}() - Fetches from {source}

All loaders use storage helpers and write metadata sidecars.

Ready to create tests (Step 5)?

After Step 6 (Integration Complete):

✅ Provider Integration Complete: {provider}

Created:
- Registry: src/ingest/{provider}/registry.py ({N} datasets)
- Loaders: src/ingest/{provider}/loader.py
- Tests: tests/test_{provider}_samples_pk.py
- Samples: samples/{provider}/ ({N} datasets)

Integration:
- ✓ Added to tools/make_samples.py
- ✓ Updated documentation
- ✓ Primary key tests passing ({N}/{N})

To use:
```bash
# Generate samples
uv run python tools/make_samples.py {provider} --datasets all --out ./samples

# Run tests
pytest tests/test_{provider}_samples_pk.py -v

# Use in production
from ingest.{provider}.loader import load_{dataset}
result = load_{dataset}(out_dir="gs://ff-analytics/raw/{provider}")

Handling User Scenarios

Scenario: User wants to add a specific API

User says: "Add integration for the ESPN Fantasy API"

Response:

Begin Step 1 (Understand the Data Source)
Ask clarifying questions about ESPN API
Guide through all 6 steps to complete integration

Scenario: User has API docs, needs implementation

User says: "I have the API docs for PFF, help me integrate it"

Response:

Ask user to share key details (datasets, auth, PKs)
Begin Step 2 (Design Registry)
Proceed through implementation steps

Scenario: User wants to fix existing provider

User says: "The nflverse loader is missing a dataset"

Response:

Read existing provider registry and loaders
Add new dataset to registry (Step 2)
Implement loader for new dataset (Step 4)
Update tests and samples (Steps 5-6)

Troubleshooting

Issue: Primary key tests failing

Review data grain - are PKs actually unique?
Check for null values in PK columns
Verify sample data represents full population
Consider composite keys if single column insufficient

Issue: Storage helper fails with GCS

Check GOOGLE_APPLICATION_CREDENTIALS environment variable
Verify GCS bucket permissions
Test with local path first, then GCS
Review references/example_storage.py for patterns

Issue: Loader returns empty data

Check authentication credentials
Verify API endpoint URLs
Review rate limiting and retries
Add debug logging to data fetching

Issue: Make_samples.py not finding provider

Ensure provider package in src/ingest/{provider}/
Check PYTHONPATH includes src/
Verify imports in make_samples.py
Run from repo root directory

Integration with Other Skills

This skill works well with:

dbt-model-builder - After ingestion, create staging models for the provider
data-quality-test-generator - Add comprehensive tests beyond primary keys
data-architecture-spec1 - Ensure provider follows SPEC-1 patterns

data-ingestion-builder

Install Skill

SKILL.md

Data Ingestion Provider Builder

When to Use This Skill

Provider Integration Philosophy

Provider Building Workflow

Step 1: Understand the Data Source

Step 2: Design the Registry

Step 3: Create Provider Package Structure

Step 4: Implement Loader Functions

Step 5: Create Primary Key Tests

Step 6: Integrate with Project Tooling

Resources Provided

references/

assets/

Best Practices

Registry Design

Loader Implementation

Testing

Integration

Common Patterns

Authentication

Pagination

Rate Limiting

Output Format

Handling User Scenarios

Scenario: User wants to add a specific API

Scenario: User has API docs, needs implementation

Scenario: User wants to fix existing provider

Troubleshooting

Integration with Other Skills