Claude Code Plugins

Community-maintained marketplace

Feedback

data-ingestion-builder

@zazu-22/ff_data_analytics
0
0

Build new data ingestion providers following the FF Analytics registry pattern. This skill should be used when adding new data sources (APIs, files, databases) to the data pipeline. Guides through creating provider packages, registry mappings, loader functions, storage integration, primary key tests, and sampling tools following established patterns.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name data-ingestion-builder
description Build new data ingestion providers following the FF Analytics registry pattern. This skill should be used when adding new data sources (APIs, files, databases) to the data pipeline. Guides through creating provider packages, registry mappings, loader functions, storage integration, primary key tests, and sampling tools following established patterns.

Data Ingestion Provider Builder

Create complete data ingestion providers for the Fantasy Football Analytics project following established patterns. This skill automates the process of adding new data sources with proper structure, metadata, testing, and integration.

When to Use This Skill

Use this skill proactively when:

  • Adding a new data source to the pipeline (API, file, database)
  • User mentions integrating data from a new provider
  • User asks about "adding a provider" or "new data source"
  • User references specific APIs or data sources to integrate (e.g., "add ESPN API", "integrate PFF data")
  • Expanding data coverage for analytics

Provider Integration Philosophy

The FF Analytics project follows these principles for data ingestion:

  1. Registry Pattern - Central mapping of datasets to loaders
  2. Storage Abstraction - Uniform Parquet output with metadata sidecars
  3. Metadata-First - Every load produces _meta.json with lineage
  4. Testable Samples - Primary key validation on sample data
  5. Local & Cloud - Same code works for local paths and gs:// URIs

Provider Building Workflow

Follow this six-step process to create a complete provider:

Step 1: Understand the Data Source

Before coding, gather information about the provider:

Ask clarifying questions:

  • What datasets does this provider offer?
  • What is the API/file format?
  • What are the authentication requirements?
  • What are the primary keys for each dataset?
  • Are there rate limits or ToS considerations?
  • What is the update frequency?

Research existing documentation:

  • API documentation URLs
  • Data schemas and field descriptions
  • Authentication methods
  • Rate limiting policies

Output: Clear understanding of:

  • Dataset names and descriptions
  • Primary keys for each dataset
  • Authentication approach
  • Any special considerations

Step 2: Design the Registry

Map datasets to loader functions and define metadata.

Use assets/registry_template.py as starting point.

For each dataset, define:

  • name: Logical dataset name (lowercase, descriptive)
  • loader_function: Function name in loader.py
  • primary_keys: Tuple of columns that uniquely identify rows
  • description: Brief description of dataset contents
  • notes: Special considerations, dependencies, or caveats

Example registry design:

REGISTRY = {
    "players": DatasetSpec(
        name="players",
        loader_function="load_players",
        primary_keys=("player_id",),
        description="Player biographical and career data",
        notes="Updates daily. Includes active and retired players."
    ),
    "stats": DatasetSpec(
        name="stats",
        loader_function="load_stats",
        primary_keys=("player_id", "game_id", "stat_type"),
        description="Game-level player statistics",
        notes="Grain: one row per player per game per stat type"
    )
}

Quality checks:

  • Primary keys are truly unique for the grain
  • Dataset names are descriptive and consistent
  • Loader function names follow load_{dataset_name} pattern

Step 3: Create Provider Package Structure

Create the directory structure following the template.

See assets/package_structure.md for complete structure.

Create directories:

mkdir -p src/ingest/{provider}
mkdir -p tests
mkdir -p samples/{provider}

Create files:

  • src/ingest/{provider}/__init__.py (empty or with exports)
  • src/ingest/{provider}/registry.py (from Step 2)
  • src/ingest/{provider}/loader.py (will implement in Step 4)
  • tests/test_{provider}_samples_pk.py (will implement in Step 5)

Naming:

  • Provider name: lowercase, underscore-separated
  • Example: nflverse, espn_api, my_provider

Step 4: Implement Loader Functions

Create loader functions using storage helper pattern.

Use assets/loader_template.py as starting point.

For each dataset in registry:

  1. Create loader function following signature:

    def load_{dataset_name}(
        out_dir: str = "data/raw/{provider}",
        **kwargs
    ) -> dict[str, Any]:
    
  2. Implement data fetching:

    • API calls with proper authentication
    • File parsing (CSV, JSON, XML, etc.)
    • Database queries
    • Handle pagination, retries, error cases
  3. Convert to DataFrame:

    • Prefer Polars for performance
    • Pandas acceptable for compatibility
    • Ensure consistent column types
  4. Write with storage helper:

    from ingest.common.storage import write_parquet_any, write_text_sidecar
    
    # Write Parquet
    write_parquet_any(df, parquet_file)
    
    # Write metadata sidecar
    metadata = {
        "dataset": dataset_name,
        "asof_datetime": datetime.now(UTC).isoformat(),
        "loader_path": "src.ingest.{provider}.loader.load_{dataset}",
        "source_name": "{PROVIDER}",
        "source_version": version,
        "output_parquet": parquet_file,
        "row_count": len(df)
    }
    write_text_sidecar(json.dumps(metadata, indent=2), f"{partition_dir}/_meta.json")
    
  5. Return manifest:

    return {
        "dataset": dataset_name,
        "partition_dir": partition_dir,
        "parquet_file": parquet_file,
        "row_count": len(df),
        "metadata": metadata
    }
    

Reference examples:

  • references/example_loader.py - Complete nflverse loader
  • references/example_storage.py - Storage helper implementation

Common patterns:

  • Use datetime.now(UTC) for all timestamps
  • Generate UUIDs for file names: uuid.uuid4().hex[:8]
  • Partition by date: dt=YYYY-MM-DD
  • Handle both local paths and gs:// URIs uniformly

Step 5: Create Primary Key Tests

Validate sample data quality with automated tests.

Use assets/test_template.py as starting point.

Test structure:

@pytest.mark.parametrize("dataset_name,spec", REGISTRY.items())
def test_{provider}_primary_keys(dataset_name, spec):
    # 1. Find sample files
    # 2. Read with Polars
    # 3. Check PK columns exist
    # 4. Check PK uniqueness
    # 5. Report duplicates if found

What to test:

  • Primary key columns exist in dataset
  • Primary key uniqueness (no duplicates)
  • Sample data is non-empty
  • Metadata sidecars exist and are valid

Run tests:

pytest tests/test_{provider}_samples_pk.py -v

Step 6: Integrate with Project Tooling

Connect the provider to existing workflows.

Update tools/make_samples.py:

Add provider-specific sampling logic:

# In make_samples.py argument parser
elif args.provider == "{provider}":
    from ingest.{provider}.loader import load_{dataset}

    # Provider-specific argument parsing
    datasets = args.datasets or ["default_dataset"]

    for dataset in datasets:
        result = load_{dataset}(
            out_dir=args.out,
            **provider_kwargs
        )
        print(f"✓ Sampled {dataset}: {result['row_count']} rows")

Update documentation:

  • src/ingest/CLAUDE.md - Add provider-specific notes
  • Root CLAUDE.md - If architecturally significant
  • README.md - If user-facing

Create sample data:

uv run python tools/make_samples.py {provider} --datasets {dataset1} {dataset2} --out ./samples

Validate:

# Check sample data created
ls -la samples/{provider}/

# Run PK tests
pytest tests/test_{provider}_samples_pk.py -v

# Check metadata
cat samples/{provider}/{dataset}/dt=*/_meta.json | jq .

Resources Provided

references/

Provider implementation examples from codebase:

  • example_registry.py - Complete registry from nflverse with 10+ datasets
  • example_loader.py - Nflverse shim loader with Python/R fallback pattern
  • example_storage.py - Storage helper with local and GCS support

Load these references when implementing a new provider to see proven patterns.

assets/

Templates for creating new providers:

  • registry_template.py - Registry.py skeleton with placeholders
  • loader_template.py - Loader function template with storage helpers
  • test_template.py - Primary key test template with pytest
  • package_structure.md - Complete directory structure and integration guide

Use these templates directly when generating provider code.

Best Practices

Registry Design

  1. Accurate primary keys - Test with real data to verify uniqueness
  2. Descriptive names - Use clear, consistent dataset names
  3. Document grain - Notes should explain row-level granularity
  4. Consider joins - Design PKs to enable joins with other datasets

Loader Implementation

  1. Handle failures gracefully - Return empty DataFrames with metadata on errors
  2. Include traceability - Capture input parameters in metadata
  3. Respect rate limits - Add delays, implement exponential backoff
  4. Validate before writing - Check schema, row counts, nulls
  5. Use storage helpers - Don't reimplement Parquet writing

Testing

  1. Test with real samples - Use actual provider data, not mocks
  2. Cover all datasets - Parametrize tests across registry
  3. Check metadata completeness - Validate all required fields
  4. Document expected failures - If some rows expected to fail PK tests

Integration

  1. Update make_samples.py - Enable easy sample generation
  2. Document requirements - Note authentication, dependencies, setup
  3. Add to CLAUDE.md - Help future developers understand the provider
  4. Consider CI/CD - Add to GitHub Actions if automated refresh needed

Common Patterns

Authentication

Environment variables:

import os

api_key = os.environ.get("{PROVIDER}_API_KEY")
if not api_key:
    raise ValueError("Set {PROVIDER}_API_KEY environment variable")

OAuth flow:

from requests_oauthlib import OAuth2Session

oauth = OAuth2Session(client_id, token=token)
response = oauth.get(endpoint)

Pagination

Offset-based:

all_data = []
offset = 0
limit = 100

while True:
    response = fetch(offset=offset, limit=limit)
    data = response.json()
    all_data.extend(data)

    if len(data) < limit:
        break
    offset += limit

Cursor-based:

all_data = []
cursor = None

while True:
    response = fetch(cursor=cursor)
    data = response.json()
    all_data.extend(data["results"])

    cursor = data.get("next_cursor")
    if not cursor:
        break

Rate Limiting

Simple delay:

import time

for dataset in datasets:
    result = load_dataset()
    time.sleep(1)  # 1 second between requests

Exponential backoff:

import time
from requests.exceptions import HTTPError

max_retries = 3
for attempt in range(max_retries):
    try:
        response = fetch()
        response.raise_for_status()
        break
    except HTTPError as e:
        if e.response.status_code == 429:  # Rate limit
            wait_time = 2 ** attempt
            time.sleep(wait_time)
        else:
            raise

Output Format

When helping user create a provider:

  1. After Step 2 (Registry Design):

    ✅ Registry Designed: {provider}
    
    Datasets defined:
    - {dataset1}: {description} (PK: {pk_columns})
    - {dataset2}: {description} (PK: {pk_columns})
    
    Ready to create package structure (Step 3)?
    
  2. After Step 4 (Loader Implementation):

    ✅ Loaders Implemented
    
    Created loader functions:
    - load_{dataset1}() - Fetches from {source}
    - load_{dataset2}() - Fetches from {source}
    
    All loaders use storage helpers and write metadata sidecars.
    
    Ready to create tests (Step 5)?
    
  3. After Step 6 (Integration Complete):

    ✅ Provider Integration Complete: {provider}
    
    Created:
    - Registry: src/ingest/{provider}/registry.py ({N} datasets)
    - Loaders: src/ingest/{provider}/loader.py
    - Tests: tests/test_{provider}_samples_pk.py
    - Samples: samples/{provider}/ ({N} datasets)
    
    Integration:
    - ✓ Added to tools/make_samples.py
    - ✓ Updated documentation
    - ✓ Primary key tests passing ({N}/{N})
    
    To use:
    ```bash
    # Generate samples
    uv run python tools/make_samples.py {provider} --datasets all --out ./samples
    
    # Run tests
    pytest tests/test_{provider}_samples_pk.py -v
    
    # Use in production
    from ingest.{provider}.loader import load_{dataset}
    result = load_{dataset}(out_dir="gs://ff-analytics/raw/{provider}")
    
    
    

Handling User Scenarios

Scenario: User wants to add a specific API

User says: "Add integration for the ESPN Fantasy API"

Response:

  1. Begin Step 1 (Understand the Data Source)
  2. Ask clarifying questions about ESPN API
  3. Guide through all 6 steps to complete integration

Scenario: User has API docs, needs implementation

User says: "I have the API docs for PFF, help me integrate it"

Response:

  1. Ask user to share key details (datasets, auth, PKs)
  2. Begin Step 2 (Design Registry)
  3. Proceed through implementation steps

Scenario: User wants to fix existing provider

User says: "The nflverse loader is missing a dataset"

Response:

  1. Read existing provider registry and loaders
  2. Add new dataset to registry (Step 2)
  3. Implement loader for new dataset (Step 4)
  4. Update tests and samples (Steps 5-6)

Troubleshooting

Issue: Primary key tests failing

  • Review data grain - are PKs actually unique?
  • Check for null values in PK columns
  • Verify sample data represents full population
  • Consider composite keys if single column insufficient

Issue: Storage helper fails with GCS

  • Check GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Verify GCS bucket permissions
  • Test with local path first, then GCS
  • Review references/example_storage.py for patterns

Issue: Loader returns empty data

  • Check authentication credentials
  • Verify API endpoint URLs
  • Review rate limiting and retries
  • Add debug logging to data fetching

Issue: Make_samples.py not finding provider

  • Ensure provider package in src/ingest/{provider}/
  • Check PYTHONPATH includes src/
  • Verify imports in make_samples.py
  • Run from repo root directory

Integration with Other Skills

This skill works well with:

  • dbt-model-builder - After ingestion, create staging models for the provider
  • data-quality-test-generator - Add comprehensive tests beyond primary keys
  • data-architecture-spec1 - Ensure provider follows SPEC-1 patterns