name	inspect-raw-data
description	Inspect raw data for debugging - MongoDB records and source files. Use before fixing issues to understand root cause. Can inspect parsed records, raw source files (CSV, TXT, PDF, JSON), compare them, and produce health assessments.

Overview

Two inspection targets:

MongoDB - Parsed records in county_data.parcels collection
Source Files - Raw data in data/<county>/<year>/ directory

Workflow

1. Check Source Data Availability

Before diagnosing parsing issues, verify source files exist:

ls -la data/<county>/<year>/

Result	Diagnosis	Action
Directory doesn't exist	Data not downloaded	Download data first
Directory empty	Download failed/incomplete	Re-download
Has CSV/TXT/PDF files	Source available	Continue to parsing check

2. Check MongoDB Records

Query for county/year data and compute metrics:

Total record count
% with missing owners
% with missing/malformed addresses
% with zero valuations

3. Compare Source vs Parsed

If issues found, compare source file values to parsed MongoDB values to identify parser bugs.

4. Summarize Health

Produce assessment: healthy, suspicious, or broken.

Source File Inspection

Directory Structure

data/<county>/<year>/
├── *.csv              # CSV exports
├── *.TXT              # Fixed-width or tab-delimited
├── *.pdf              # PDF appraisal rolls
├── *.json             # JSON exports
└── snapshots/         # Web scrape snapshots (if applicable)

Finding Source Files

# List all files for a county/year
ls -la data/<county>/<year>/

# Find all source files
find data/<county>/<year>/ -type f \( -name "*.csv" -o -name "*.TXT" -o -name "*.json" -o -name "*.pdf" \)

# Check file size
du -h data/<county>/<year>/*

Inspecting Source Files

# CSV - view headers and first rows
head -5 data/<county>/<year>/*.csv

# TXT - view structure (often fixed-width)
head -20 data/<county>/<year>/*.TXT

# Count lines
wc -l data/<county>/<year>/*

# Find specific record in source
grep "R000001" data/<county>/<year>/*.csv

Parser Type Detection

# Look for county-specific parser
ls county_parser/parsers/*<county>*.py

# Check if county has a spec file (appraisal parser)
ls county_parser/parsers/appraisal_info_parser/<county>*.json

# Check if in PDF registry
grep "<county>" county_parser/parsers/pdf_parser_registry.py

Parser Type	How to identify	Target File
`explicit`	Has `<county>_county_parser.py`	`county_parser/parsers/<county>_county_parser.py`
`appraisal`	Has spec in `appraisal_info_parser/`	`county_parser/parsers/appraisal_info_parser/base.py`
`pdf`	Listed in `PDF_ONLY_COUNTIES`	`county_parser/parsers/pdf_parser_base.py`
`csv`	Has `.csv` source files	`county_parser/parsers/csv_parser_base.py`

MongoDB Inspection

Sample Records

from pymongo import MongoClient

client = MongoClient()
db = client["county_data"]

# Sample random records
records = list(db.parcels.aggregate([
    {"$match": {"county": "archer"}},
    {"$sample": {"size": 5}}
]))

# Get a specific record by county_id
record = db.parcels.find_one({"county": "archer", "county_id": "R000001"})

# View specific fields
records = list(db.parcels.find(
    {"county": "archer"},
    {"county_id": 1, "property_address": 1, "mailing_address": 1, "tax_year": 1}
).limit(10))

Tax Year Breakdown

pipeline = [
    {"$match": {"county": "archer"}},
    {"$group": {"_id": "$tax_year", "count": {"$sum": 1}}},
    {"$sort": {"_id": -1}}
]
list(db.parcels.aggregate(pipeline))

Find Bad Records

import re

ZIP_PATTERN = re.compile(r"^\d{5}(-\d{4})?$")

# Find records with malformed zip codes
records = list(db.parcels.find(
    {"county": "archer", "mailing_address.zip_code": {"$exists": True}},
    {"county_id": 1, "mailing_address.zip_code": 1}
).limit(100))

for r in records:
    zip_code = r.get("mailing_address", {}).get("zip_code")
    if zip_code and not ZIP_PATTERN.match(str(zip_code)):
        print(f"{r['county_id']}: {zip_code}")

Find Incomplete Records

# Property city missing but mailing city exists
records = list(db.parcels.find({
    "county": "archer",
    "mailing_address.city": {"$nin": [None, ""]},
    "$or": [
        {"property_address": None},
        {"property_address.city": None},
        {"property_address.city": ""}
    ]
}, {"county_id": 1, "property_address": 1, "mailing_address": 1}).limit(10))

Data Quality Metrics

# Count owners and valuations
pipeline = [
    {"$match": {"county": "<county>", "tax_year": 2024}},
    {"$facet": {
        "total": [{"$count": "count"}],
        "with_owner": [
            {"$addFields": {"first_owner": {"$arrayElemAt": ["$owners", 0]}}},
            {"$match": {"first_owner.name": {"$exists": True, "$nin": [None, ""]}}},
            {"$count": "count"}
        ],
        "with_valuation": [
            {"$match": {"$or": [
                {"valuation.market_value": {"$gt": 0}},
                {"valuation.assessed_value": {"$gt": 0}}
            ]}},
            {"$count": "count"}
        ]
    }}
]

Useful Field Projections

Scenario	Projection
Address issues	`{"county_id": 1, "property_address": 1, "mailing_address": 1}`
Owner issues	`{"county_id": 1, "owners": 1, "january_1_owner": 1}`
Value issues	`{"county_id": 1, "valuation": 1}`
Full record	`{}` (no projection)

Common Inspection Patterns

Pattern 1: Why is this field wrong?

Sample records with issues from MongoDB
Note the county_id of a bad record
Find it in source file: grep "<county_id>" data/<county>/<year>/*.csv
Compare source value vs parsed value

Pattern 2: Is this legacy or current data?

# Get tax year breakdown - issues only in old years = legacy data problem
pipeline = [
    {"$match": {"county": "archer"}},
    {"$group": {"_id": "$tax_year", "count": {"$sum": 1}}},
    {"$sort": {"_id": -1}}
]
years = list(db.parcels.aggregate(pipeline))

Pattern 3: What fields are populated?

records = list(db.parcels.find({"county": "archer"}).limit(10))
for r in records:
    prop = r.get("property_address") or {}
    mail = r.get("mailing_address") or {}
    print(f"{r['county_id']}: prop_city={prop.get('city')}, mail_city={mail.get('city')}")

Health Report Template

After inspection, summarize findings:

## <County> <Year> Health Report

### SOURCE FILES
- Directory: data/<county>/<year>/
- Files: [list files present]
- Status: Available / Missing / Empty

### RECORD COUNTS
- Total: X records
- With owners: X% 
- With valuation: X%

### FIELD ANOMALIES
- [List any issues found]

### OVERALL ASSESSMENT
- Status: healthy / suspicious / broken
- Rationale: [brief explanation]

data-quality skill - Uses inspection before fixing
fix-patterns.md - What to fix after inspection

inspect-raw-data

Install Skill

SKILL.md