| name | inspect-county-data-health |
| description | Inspect data health for a given county and tax year by checking MongoDB records and data/<county>/<year> artifacts for completeness and obvious anomalies. |
| license | MIT |
| compatibility | opencode |
| metadata | [object Object] |
What I do
I provide a data health check for a given county and tax year.
For a specific county + tax_year, I:
- Analyze MongoDB records for that slice:
- Does data exist?
- Rough row counts (and how they compare to expectations if available)
- Obvious field-level anomalies (e.g. missing owners, empty mailing addresses, zero or null critical fields)
- Inspect filesystem artifacts under
data/<county>/<year>/...(the current layout)- Presence of any logs or auxiliary files used by the pipeline
- Cross-check run metadata in Mongo (e.g.
processing_runs) for that county/year - Produce a short health report:
- Is this county/year probably OK, suspicious, or clearly broken?
- What kinds of issues show up (coverage, schema, or value-level)?
This skill is aligned with the current design where the pipeline syncs directly to MongoDB and writes per-county/year artifacts under data/<county>/<year>/, not data_unified or data_raw.
When to use me
Use this skill when:
- You’ve run the pipeline for a county/year and want a quick sanity check
- You suspect there might be silent data issues (e.g. low coverage, malformed addresses)
- You’re deciding which counties to trust for downstream consumers (exports, APIs, analytics)
This is a read-only health check, not a fixer.
Inputs I expect
county(required, string)- e.g.
"travis"
- e.g.
tax_year(required, int)- e.g.
2025
- e.g.
expected_min_records(optional, int)- If provided, I’ll compare actual record counts to this threshold.
If expected_min_records is not provided, I’ll still report counts, but I won’t assert strict pass/fail on coverage.
Project assumptions
I assume:
- This skill is run from the project root.
- Core property data lives in MongoDB, with collections that can be filtered by county and tax_year.
- Per-county/year artifacts live under:
data/<county>/<tax_year>/...
- There is run metadata (e.g.
processing_runs) that can be used to see whether this county/year has a completed or failed run.
If these assumptions do not hold (e.g. MongoDB is unreachable or the expected collections are missing), I’ll report that explicitly.
How I work (high level)
Normalize input
- Confirm
countyandtax_yearare present.
- Confirm
Check MongoDB connectivity
- Use MongoDB tools to ensure the DB is reachable.
Gather run metadata
- Look up
processing_runs(or equivalent) entries for this county/year:- Has a run completed?
- Was it marked success/failed/partial?
- Are there any stored error messages?
- Look up
Inspect main data collections
- Query the main property data collection(s) for records matching this county/year.
- Compute simple metrics, for example:
- Total record count
- % of records with missing owner names
- % of records with missing or malformed mailing addresses
- % of records with obviously invalid numeric fields (e.g. non-positive improvement values when they should be > 0, if that pattern exists)
- If
expected_min_recordsis provided:- Compare actual count vs expected and flag if significantly lower.
Inspect data/
/ / artifacts - Check whether
data/<county>/<tax_year>/exists. - Note presence of any logs or sidecar files that might indicate warnings or partial processing.
- Check whether
Summarize health
I’ll produce a concise report with sections like:
- RUN STATUS
- Based on
processing_runs(if available).
- Based on
- COVERAGE
- Record counts, and whether they meet
expected_min_records(if provided).
- Record counts, and whether they meet
- FIELD ANOMALIES
- High-level stats on missing or malformed key fields.
- FILESYSTEM ARTIFACTS
- Basic info on
data/<county>/<tax_year>/presence.
- Basic info on
- OVERALL ASSESSMENT
healthy,suspicious, orbroken, with a short rationale.
- RUN STATUS
Safety and scope rules
- I am strictly read-only; I do not modify MongoDB or any files.
- I do not attempt any repairs; I only surface issues and possible causes.
- If schema details differ from expectations, I’ll adjust the analysis to what is actually present instead of forcing a specific shape.
Example usage
"Use
inspect-county-data-healthfor countytravisin 2025, expecting at least 200k records."
I will:
- Check MongoDB for travis/2025 processing runs
- Count travis/2025 records in the main property data collection(s)
- Compute simple anomaly metrics on key fields
- Look for
data/travis/2025/on disk - Return a short health report, including whether the ~200k expectation was met.