name	scraping-data-pipeline
description	Use this skill when scraping UFC fighter data from UFCStats.com, validating scraped data, loading data into the database, or running the complete scraping pipeline. Handles fighters list, fighter details, events, and fight history. Includes data validation, error recovery, cache invalidation, and progress reporting.

You are an expert at orchestrating the UFC Pokedex scraping pipeline, which involves multiple steps from data collection to database loading.

Pipeline Overview

The scraping pipeline follows this flow:

UFCStats.com → Scrapy Spiders → JSON/JSONL files → Validation → Database → Cache Invalidation

When to Use This Skill

Invoke this skill when the user wants to:

Scrape fighter data (list or details)
Load scraped data into the database
Validate scraped JSON/JSONL files
Run the complete scraping pipeline
Handle missing or failed scrapes
Update fighter data from UFC Stats

Available Operations

1. Scraping Operations

Scrape Fighter List

Scrapes all fighter URLs from UFCStats.com alphabetical listing.

Command:

make scraper

Output: data/processed/fighters_list.jsonl

What it does:

Crawls http://ufcstats.com/statistics/fighters alphabetically
Extracts fighter IDs and basic info
Saves to JSONL format (one fighter per line)
Takes ~5-10 minutes for full list

Scrape Fighter Details

Scrapes detailed information for specific fighters.

Command options:

# Sample scrape (single fighter for testing)
make scrape-sample

# All fighters from list
make scraper-details

# Specific fighters by ID
.venv/bin/scrapy crawl fighter_detail -a fighter_ids=id1,id2,id3

# Specific fighters by URL
.venv/bin/scrapy crawl fighter_detail -a fighter_urls=url1,url2

Output: data/processed/fighters/{id}.json (individual files)

What it scrapes:

Personal info (name, nickname, DOB, height, weight, reach, stance)
Complete fight history with results
Career statistics
Current record (W-L-D)

2. Data Loading Operations

Load Sample Data (SQLite-safe)

Loads first 8 fighters from fixtures for quick testing.

Command:

make api:seed

Use when:

Testing on SQLite
Need quick test data
Developing new features

Load All Scraped Data

Loads complete scraped dataset into database.

Command:

make load-data

Use when:

Have full scraped dataset
Running PostgreSQL
Need production data

Safety note: Blocked on SQLite by default (10K+ fighters). Override with:

ALLOW_SQLITE_PROD_SEED=1 make api:seed-full

Load Sample of Scraped Data

Loads only first N fighters for testing.

Command:

make load-data-sample

Use when:

Testing with realistic data
Need more than 8 fighters but not full dataset
Validating data loading logic

3. Data Validation

Before loading, validate the scraped data:

Steps:

Check files exist:

ls -lh data/processed/fighters_list.jsonl
ls -lh data/processed/fighters/*.json | wc -l

Validate JSONL structure:

head -5 data/processed/fighters_list.jsonl | jq '.'

Check for parsing errors:

grep -i "error\|exception" data/processed/fighters_list.jsonl

Count records:

wc -l data/processed/fighters_list.jsonl

Complete Pipeline Workflows

Workflow 1: Full Scrape and Load (PostgreSQL)

Use this for complete data refresh on PostgreSQL.

Steps:

# 1. Ensure PostgreSQL is running
docker-compose up -d

# 2. Scrape fighter list (~5-10 min)
make scraper

# 3. Validate fighter list
wc -l data/processed/fighters_list.jsonl
head -3 data/processed/fighters_list.jsonl | jq '.'

# 4. Scrape fighter details (~30-60 min for all)
make scraper-details

# 5. Validate detail files
ls data/processed/fighters/*.json | wc -l

# 6. Load into database
make load-data

# 7. Verify in database
PGPASSWORD=ufc_pokedex psql -h localhost -p 5432 -U ufc_pokedex -d ufc_pokedex -c "SELECT COUNT(*) FROM fighters;"

# 8. Restart backend to invalidate cache
pkill -f uvicorn
make api

Expected duration: 40-90 minutes total

Workflow 2: Sample Scrape and Load (Quick Test)

Use this for quick testing or development.

Steps:

# 1. Start with SQLite (no Docker needed)
USE_SQLITE=1

# 2. Scrape sample fighter
make scrape-sample

# 3. Seed with fixtures
make api:seed

# 4. Start backend
make api:dev

# 5. Verify
curl http://localhost:8000/fighters/ | jq '.fighters | length'

Expected duration: 2-3 minutes total

Workflow 3: Update Specific Fighters

Use this to refresh data for specific fighters.

Steps:

# 1. Get fighter IDs that need updating
# (from database or fighters_list.jsonl)

# 2. Scrape specific fighters
.venv/bin/scrapy crawl fighter_detail -a fighter_ids=id1,id2,id3

# 3. Reload just those fighters
# (Currently requires full reload - see limitations below)
make load-data

# 4. Restart backend
pkill -f uvicorn && make api

Workflow 4: Handle Missing/Failed Scrapes

If some fighters failed to scrape:

Steps:

# 1. Find fighters with missing detail files
# Compare fighters_list.jsonl count vs. fighters/*.json count
TOTAL=$(wc -l < data/processed/fighters_list.jsonl)
SCRAPED=$(ls data/processed/fighters/*.json 2>/dev/null | wc -l)
echo "Total: $TOTAL, Scraped: $SCRAPED, Missing: $((TOTAL - SCRAPED))"

# 2. Identify missing IDs
# (You may need to write a quick script)

# 3. Re-scrape missing fighters
.venv/bin/scrapy crawl fighter_detail -a fighter_ids=missing_id1,missing_id2

# 4. Reload data
make load-data

Cache Management

After loading data, invalidate Redis cache (if using Redis):

Commands:

# Check Redis connection
redis-cli ping

# Clear all fighter-related cache
redis-cli KEYS "fighters:*" | xargs redis-cli DEL

# Or clear entire cache
redis-cli FLUSHDB

Note: Backend gracefully degrades if Redis is unavailable, so cache clearing is optional.

Error Handling

Common Issues and Solutions

Issue: "No such file or directory: data/processed/fighters_list.jsonl"

Solution: Run make scraper first to create the fighter list.

Issue: Scraper fails with connection errors

Solution:

Check internet connection
Verify UFCStats.com is accessible
Reduce concurrent requests in scraper/settings.py
Add delays: SCRAPER_DELAY_SECONDS=2.0

Issue: Database constraint violations during load

Solution:

Drop and recreate database: make db-reset
Check for duplicate IDs in scraped data
Validate JSON structure

Issue: "Blocked on SQLite" when loading full dataset

Solution:

Use PostgreSQL for production data: docker-compose up -d
Or override safety check: ALLOW_SQLITE_PROD_SEED=1 make api:seed-full

Issue: Partial scrapes (some fighters missing)

Solution:

Use Workflow 4 above to identify and re-scrape missing fighters
Check scraper logs for errors
Verify HTML structure hasn't changed on UFCStats.com

Data Validation Checklist

Before loading, verify:

fighters_list.jsonl exists and has expected count (~1000-2000 fighters)
Each line in JSONL is valid JSON
Fighter detail JSON files exist in data/processed/fighters/
Sample fighter JSON has required fields: id, name, record
No parsing errors in JSON files
Database is running (PostgreSQL) or SQLite mode enabled
Database schema is up to date: make db-upgrade

Progress Monitoring

Monitor scraping progress:

# Watch fighter details being created
watch -n 5 'ls data/processed/fighters/*.json | wc -l'

# Check scrapy logs
tail -f scrapy.log

Monitor database loading:

# Check fighter count in database
PGPASSWORD=ufc_pokedex psql -h localhost -p 5432 -U ufc_pokedex -d ufc_pokedex -c "SELECT COUNT(*) FROM fighters;"

# Check recent inserts
PGPASSWORD=ufc_pokedex psql -h localhost -p 5432 -U ufc_pokedex -d ufc_pokedex -c "SELECT name, record, created_at FROM fighters ORDER BY created_at DESC LIMIT 10;"

Best Practices

Always validate before loading - Check JSONL structure and counts
Use sample loads for testing - Don't load 10K fighters into SQLite
Monitor scraping progress - Scrapy can fail silently on some fighters
Respect rate limits - UFCStats.com should not be hammered
Clear cache after loading - Ensure API serves fresh data
Back up before full reload - Database drops and recreates tables
Test with sample first - Validate pipeline with make scrape-sample

Limitations

Incremental updates not supported - Must reload all data (no upsert logic yet)
No automatic scheduling - Must trigger scrapes manually
No fighter image scraping - Images handled separately (see managing-fighter-images skill)
SQLite safety checks - Full dataset blocked on SQLite (design choice)
No event scraping yet - Events spider exists but not integrated

Quick Reference

# Full pipeline (PostgreSQL)
make scraper && make scraper-details && make load-data

# Quick test (SQLite)
make scrape-sample && make api:seed && make api:dev

# Validate data
wc -l data/processed/fighters_list.jsonl
ls data/processed/fighters/*.json | wc -l

# Clear cache
redis-cli FLUSHDB

# Check database
PGPASSWORD=ufc_pokedex psql -h localhost -U ufc_pokedex -d ufc_pokedex -c "SELECT COUNT(*) FROM fighters;"

scraping-data-pipeline

Install Skill

SKILL.md

Pipeline Overview

When to Use This Skill

Available Operations

1. Scraping Operations

Scrape Fighter List

Scrape Fighter Details

2. Data Loading Operations

Load Sample Data (SQLite-safe)

Load All Scraped Data

Load Sample of Scraped Data

3. Data Validation

Complete Pipeline Workflows

Workflow 1: Full Scrape and Load (PostgreSQL)

Workflow 2: Sample Scrape and Load (Quick Test)

Workflow 3: Update Specific Fighters

Workflow 4: Handle Missing/Failed Scrapes

Cache Management

Error Handling

Common Issues and Solutions

Issue: "No such file or directory: data/processed/fighters_list.jsonl"

Issue: Scraper fails with connection errors

Issue: Database constraint violations during load

Issue: "Blocked on SQLite" when loading full dataset

Issue: Partial scrapes (some fighters missing)

Data Validation Checklist

Progress Monitoring

Monitor scraping progress:

Monitor database loading:

Best Practices

Limitations

Quick Reference