| name | inspect-raw-data |
| description | Inspect raw data for debugging - MongoDB records and source files. Use before fixing issues to understand root cause. Can inspect parsed records, raw source files (CSV, TXT, PDF, JSON), compare them, and produce health assessments. |
Overview
Two inspection targets:
- MongoDB - Parsed records in
county_data.parcelscollection - Source Files - Raw data in
data/<county>/<year>/directory
Workflow
1. Check Source Data Availability
Before diagnosing parsing issues, verify source files exist:
ls -la data/<county>/<year>/
| Result | Diagnosis | Action |
|---|---|---|
| Directory doesn't exist | Data not downloaded | Download data first |
| Directory empty | Download failed/incomplete | Re-download |
| Has CSV/TXT/PDF files | Source available | Continue to parsing check |
2. Check MongoDB Records
Query for county/year data and compute metrics:
- Total record count
- % with missing owners
- % with missing/malformed addresses
- % with zero valuations
3. Compare Source vs Parsed
If issues found, compare source file values to parsed MongoDB values to identify parser bugs.
4. Summarize Health
Produce assessment: healthy, suspicious, or broken.
Source File Inspection
Directory Structure
data/<county>/<year>/
├── *.csv # CSV exports
├── *.TXT # Fixed-width or tab-delimited
├── *.pdf # PDF appraisal rolls
├── *.json # JSON exports
└── snapshots/ # Web scrape snapshots (if applicable)
Finding Source Files
# List all files for a county/year
ls -la data/<county>/<year>/
# Find all source files
find data/<county>/<year>/ -type f \( -name "*.csv" -o -name "*.TXT" -o -name "*.json" -o -name "*.pdf" \)
# Check file size
du -h data/<county>/<year>/*
Inspecting Source Files
# CSV - view headers and first rows
head -5 data/<county>/<year>/*.csv
# TXT - view structure (often fixed-width)
head -20 data/<county>/<year>/*.TXT
# Count lines
wc -l data/<county>/<year>/*
# Find specific record in source
grep "R000001" data/<county>/<year>/*.csv
Parser Type Detection
# Look for county-specific parser
ls county_parser/parsers/*<county>*.py
# Check if county has a spec file (appraisal parser)
ls county_parser/parsers/appraisal_info_parser/<county>*.json
# Check if in PDF registry
grep "<county>" county_parser/parsers/pdf_parser_registry.py
| Parser Type | How to identify | Target File |
|---|---|---|
explicit |
Has <county>_county_parser.py |
county_parser/parsers/<county>_county_parser.py |
appraisal |
Has spec in appraisal_info_parser/ |
county_parser/parsers/appraisal_info_parser/base.py |
pdf |
Listed in PDF_ONLY_COUNTIES |
county_parser/parsers/pdf_parser_base.py |
csv |
Has .csv source files |
county_parser/parsers/csv_parser_base.py |
MongoDB Inspection
Sample Records
from pymongo import MongoClient
client = MongoClient()
db = client["county_data"]
# Sample random records
records = list(db.parcels.aggregate([
{"$match": {"county": "archer"}},
{"$sample": {"size": 5}}
]))
# Get a specific record by county_id
record = db.parcels.find_one({"county": "archer", "county_id": "R000001"})
# View specific fields
records = list(db.parcels.find(
{"county": "archer"},
{"county_id": 1, "property_address": 1, "mailing_address": 1, "tax_year": 1}
).limit(10))
Tax Year Breakdown
pipeline = [
{"$match": {"county": "archer"}},
{"$group": {"_id": "$tax_year", "count": {"$sum": 1}}},
{"$sort": {"_id": -1}}
]
list(db.parcels.aggregate(pipeline))
Find Bad Records
import re
ZIP_PATTERN = re.compile(r"^\d{5}(-\d{4})?$")
# Find records with malformed zip codes
records = list(db.parcels.find(
{"county": "archer", "mailing_address.zip_code": {"$exists": True}},
{"county_id": 1, "mailing_address.zip_code": 1}
).limit(100))
for r in records:
zip_code = r.get("mailing_address", {}).get("zip_code")
if zip_code and not ZIP_PATTERN.match(str(zip_code)):
print(f"{r['county_id']}: {zip_code}")
Find Incomplete Records
# Property city missing but mailing city exists
records = list(db.parcels.find({
"county": "archer",
"mailing_address.city": {"$nin": [None, ""]},
"$or": [
{"property_address": None},
{"property_address.city": None},
{"property_address.city": ""}
]
}, {"county_id": 1, "property_address": 1, "mailing_address": 1}).limit(10))
Data Quality Metrics
# Count owners and valuations
pipeline = [
{"$match": {"county": "<county>", "tax_year": 2024}},
{"$facet": {
"total": [{"$count": "count"}],
"with_owner": [
{"$addFields": {"first_owner": {"$arrayElemAt": ["$owners", 0]}}},
{"$match": {"first_owner.name": {"$exists": True, "$nin": [None, ""]}}},
{"$count": "count"}
],
"with_valuation": [
{"$match": {"$or": [
{"valuation.market_value": {"$gt": 0}},
{"valuation.assessed_value": {"$gt": 0}}
]}},
{"$count": "count"}
]
}}
]
Useful Field Projections
| Scenario | Projection |
|---|---|
| Address issues | {"county_id": 1, "property_address": 1, "mailing_address": 1} |
| Owner issues | {"county_id": 1, "owners": 1, "january_1_owner": 1} |
| Value issues | {"county_id": 1, "valuation": 1} |
| Full record | {} (no projection) |
Common Inspection Patterns
Pattern 1: Why is this field wrong?
- Sample records with issues from MongoDB
- Note the
county_idof a bad record - Find it in source file:
grep "<county_id>" data/<county>/<year>/*.csv - Compare source value vs parsed value
Pattern 2: Is this legacy or current data?
# Get tax year breakdown - issues only in old years = legacy data problem
pipeline = [
{"$match": {"county": "archer"}},
{"$group": {"_id": "$tax_year", "count": {"$sum": 1}}},
{"$sort": {"_id": -1}}
]
years = list(db.parcels.aggregate(pipeline))
Pattern 3: What fields are populated?
records = list(db.parcels.find({"county": "archer"}).limit(10))
for r in records:
prop = r.get("property_address") or {}
mail = r.get("mailing_address") or {}
print(f"{r['county_id']}: prop_city={prop.get('city')}, mail_city={mail.get('city')}")
Health Report Template
After inspection, summarize findings:
## <County> <Year> Health Report
### SOURCE FILES
- Directory: data/<county>/<year>/
- Files: [list files present]
- Status: Available / Missing / Empty
### RECORD COUNTS
- Total: X records
- With owners: X%
- With valuation: X%
### FIELD ANOMALIES
- [List any issues found]
### OVERALL ASSESSMENT
- Status: healthy / suspicious / broken
- Rationale: [brief explanation]
Related
- data-quality skill - Uses inspection before fixing
- fix-patterns.md - What to fix after inspection