name	stci-dataops
description	Research and onboard new data sources for STCI, producing source profiles, collector stubs, and validation rules

STCI Data Operations Skill

Purpose

This skill assists with researching, evaluating, and onboarding new data sources for the STCI (Standard Token Cost Index). Given a provider or aggregator, it produces structured outputs ready for implementation.

Capabilities

Source Research: Analyze a pricing source (provider page, aggregator API) and extract key characteristics
Legal Assessment: Review ToS, robots.txt, and rate limits for compliance
Source Profile: Generate a complete source profile following the template
Collector Stub: Produce a Python collector module for the source
Validation Rules: Define source-specific validation and anomaly detection

Invocation

Use this skill when you need to:

Add a new LLM provider to STCI
Evaluate an aggregator API for data sourcing
Create a source profile for a pricing endpoint
Generate collector code for a new source

Workflow

Step 1: Source Identification

Given a source URL or name, gather:

Source type (provider, aggregator, community)
API endpoint (if available)
Pricing page URL
Data format (JSON, HTML, etc.)

Step 2: Legal Review

Check and document:

robots.txt allows access to target endpoints
ToS permits automated data access
ToS permits data redistribution
Rate limits and API quotas
Authentication requirements

Decision Matrix:

Condition	Action
API explicitly public + no ToS restrictions	Proceed to T1/T2 source
API available but ToS unclear	Request legal review before production
Scraping required + ToS prohibits	Do NOT proceed - use alternative
Manual collection only viable	Proceed as T4 source with caveats

Step 3: Data Structure Analysis

For API sources:

# Fetch sample data
curl -s [API_URL] | head -100

# Analyze structure
# - Identify model ID field
# - Identify pricing fields (input, output, per-request)
# - Identify metadata fields (context window, etc.)
# - Note rate units (per-token, per-1K, per-1M)

For HTML sources:

# Check page structure
# - Identify pricing table elements
# - Note update indicators (timestamps, version)
# - Assess scraping complexity

Step 4: Generate Source Profile

Output a completed source profile using template: 000-docs/009-DR-TMPL-source-profile.md

Required sections:

Source Identification
Legal & Compliance
Data Availability
Pricing Data Structure
Collection Strategy
Normalization Mapping
Validation Rules
Monitoring & Alerting
Source Acceptance Checklist

Step 5: Generate Collector Stub

Produce a Python module following this pattern:

# services/collector/sources/{source_id}.py

from .base import BaseSource

class {SourceName}Source(BaseSource):
    """
    {Source description}
    See: {Source URL}
    """

    API_URL = "{api_url}"

    @property
    def source_id(self) -> str:
        return "{source_id}"

    @property
    def source_tier(self) -> str:
        return "{tier}"  # T1, T2, T3, or T4

    def fetch(self, target_date: date) -> List[dict]:
        # Implementation
        pass

Step 6: Define Validation Rules

Specify source-specific rules:

validation:
  required_fields:
    - model_id
    - input_rate
    - output_rate

  rate_bounds:
    input_max: 100.0  # USD per 1M
    output_max: 500.0

  model_id_pattern: "^[a-z0-9-]+/[a-z0-9.-]+$"

  cross_reference:
    enabled: true
    tolerance: 0.10  # 10% tolerance
    reference_source: "openrouter"

Step 7: Generate Test Fixtures

Create test data for the source:

// data/fixtures/{source_id}_sample.json
[
  {
    "observation_id": "obs-2026-01-01-{source_id}-{model}",
    "provider": "{provider}",
    ...
  }
]

Output Checklist

After running this skill, you should have:

Source profile document (000-docs/0XX-DR-REFF-{source_id}-profile.md)
Collector module (services/collector/sources/{source_id}.py)
Test fixtures (data/fixtures/{source_id}_sample.json)
Validation config update (data/fixtures/methodology.yaml)
Beads task for implementation tracking

Example: Onboarding OpenRouter

Input

Source: OpenRouter
URL: https://openrouter.ai
API: https://openrouter.ai/api/v1/models

Research Output

Legal Assessment:

robots.txt: Allows /api/
ToS: Public API, no explicit redistribution restriction found
Rate limits: Standard API limits apply
Auth: None required for models endpoint
Risk: LOW

Data Structure:

{
  "data": [
    {
      "id": "openai/gpt-4o",
      "name": "GPT-4o",
      "pricing": {
        "prompt": "0.0000025",
        "completion": "0.00001"
      }
    }
  ]
}

Normalization:

id → model_id
pricing.prompt × 1M → input_rate_usd_per_1m
pricing.completion × 1M → output_rate_usd_per_1m

Source Tier: T1 (public API, high confidence)

stci-dataops

Install Skill

SKILL.md