name

data-quality-validation

description

Systematic data validation, error detection, cross-source reconciliation, and query correctness checking for analytical work. Use when validating Snowflake queries, catching calculation errors, reconciling metrics across different data sources, checking for null values, ensuring date range validity, detecting statistical anomalies, validating metric calculations (median vs mean, rate normalization), checking aggregation grain (per-record vs per-entity), validating contribution analysis for non-additive metrics, or validating consistency across analysis sections. Essential when reviewing analysis before publication, debugging unexpected results, or ensuring data quality in reports. Triggers include "validate this query", "check for errors", "why don't these numbers match", "should I use median or mean", "why don't contributions sum to 100%", "reconcile these metrics", "verify data quality", or any request to catch potential issues in data or calculations.

Data Quality & Validation

Systematic framework for catching data quality issues, query errors, metric calculation problems, and inconsistencies before they affect analysis results.

When to Use This Skill

Proactive Validation (Before Analysis):

Reviewing Snowflake queries for correctness
Validating date ranges for analysis periods
Checking for null values in critical fields
Ensuring balanced time periods for experiments
Verifying data availability before starting work
Validating metric choice (median vs mean)
Checking aggregation grain (per-record vs per-entity)

Reactive Validation (Investigating Issues):

Numbers don't match across different sections
Unexpected results that seem wrong
Metrics from different tables don't reconcile
Analysis shows contradictory patterns
Stakeholder questions results accuracy
Averages seem inflated or deflated
Contributions don't sum to 100%

Pre-Publication Checks:

Final review before sharing analysis
Cross-checking totals against detail sums
Validating all percentages sum to ~100%
Ensuring consistent methodology throughout
Confirming appropriate metric selection

Validation Philosophy: Inform, Don't Block

All validations in this skill are INFORMATIONAL - nothing blocks your analysis.

What This Means

✅ Analysis always proceeds

Validations note potential issues
You decide whether to fix or continue
Work isn't stopped waiting for "perfect" data

ℹ️ Three types of responses:

Informational notes - Data quality observations

ℹ️ Data Quality Note: Field has 8% null values.
   Analysis proceeds with 92% of data.

Warnings with recommendations - Suggests improvements

⚠️ Period Balance Issue: POST has 23 days (not multiple of 7).
   Recommendation: Use 21 or 28 days to avoid weekday bias.
   Analysis proceeds with current period.

Questions for clarification - Confirms your intent

❓ Vertical Filter Check: You mentioned "for sale leads" but 
   query has no filter. Should this include all verticals?

🎯 You always control what happens next:

Accept the current approach
Apply recommended fix
Investigate flagged issues
Proceed with noted limitations

Why This Approach

Faster iteration

Don't wait for perfect data
Proceed with 90%+ completeness
Refine in subsequent analyses

You know your context

Might have good reasons for approach
Can prioritize what matters most
Decide acceptable trade-offs

Transparency

Always aware of data quality
Can document caveats in results
Prepared for stakeholder questions

Learning opportunity

Understand best practices
Build better queries over time
Recognize patterns to watch for

What Gets Validated (All Informational)

Category	What It Checks	User Decides
Query Correctness	Missing GROUP BY, division errors, date ranges	Fix now or run as-is
Data Consistency	Totals match details, percentages sum, contributions	Investigate or proceed
Cross-Source	Metrics align across tables	Which source to trust
Null Values	Missing data in fields	Acceptable % or not
Date Balance	X*7 day periods for experiments	Adjust period or accept bias
Anomalies	Statistical outliers	Investigate or include
Vertical Filters	for_sale/for_rent/seller present	Which vertical(s) to include
Metric Calculation	Median vs mean, rate normalization	Which metric to use
Aggregation Grain	Per-record vs per-entity	Which aggregation level
Contribution Analysis	Non-additive metrics (median)	Accept gap or use mean

Response Pattern

Every validation follows this structure:

[Icon] [Validation Type]:
   - What was detected
   - Why it might matter
   - Recommendation (if applicable)
   
Analysis proceeds [with current approach / with noted limitation].

You'll see:

ℹ️ Informational (for awareness)
⚠️ Warning (recommended to address, but optional)
❓ Question (needs clarification)
✓ Validated (all good)

You'll never see:

❌ Error - Cannot proceed
🛑 Analysis blocked
⚡ Must fix before continuing

Bottom line: This skill helps you make informed decisions, not enforce rigid rules. You stay in control of your analysis.

Core Workflow

Standard Validation Process

Step 1: Query Correctness

Check before running:
→ Are date ranges correct and properly bounded?
→ Do aggregations have proper GROUP BY clauses?
→ Are calculations using correct formulas?
→ Do divisions use NULLIF to prevent errors?
→ Are there any hardcoded values that should be parameters?

Step 2: Data Quality Checks

After retrieving data:
→ Check for NULL values in critical fields (informational)
→ Verify date ranges match what was requested
→ Look for unexpected patterns or anomalies
→ Validate record counts are reasonable
→ Check for duplicate records if unexpected

Step 3: Metric Calculation Validation

Before calculating metrics:
→ Is data skewed? (Use median instead of mean)
→ Comparing rates at same time scale? (Normalize first)
→ Using correct baseline for percentages?
→ Need weighted average instead of simple average?

Step 4: Aggregation Grain Check

When aggregating:
→ Are there multiple records per entity?
→ Should this be per-record or per-entity calculation?
→ Do duplicates affect the metric?

Step 5: Cross-Section Validation

Within the analysis:
→ Do summary totals equal sum of details?
→ Do percentages sum to ~100% (tolerance for rounding)?
→ For non-additive metrics (median), is gap expected?
→ Are metrics consistent across sections?
→ Do related numbers make logical sense together?

Step 6: Cross-Source Reconciliation

When using multiple tables:
→ Document expected relationship between sources
→ Calculate and explain any differences
→ Validate join keys match properly
→ Identify which source is authoritative

Validation Categories

1. Query Correctness

Common Issues:

Date ranges don't match analysis definition
Unbalanced pre/post periods (different # of weeks)
Missing or incorrect GROUP BY clauses
Wrong aggregation level (per-record vs per-entity)
Division without NULLIF protection

Quick Validation:

-- Add to queries as sanity check
SELECT 
    MIN(event_date) as earliest,
    MAX(event_date) as latest,
    COUNT(*) as total_records,
    COUNT(DISTINCT entity_id) as unique_entities,
    CURRENT_DATE() as today
FROM your_table;

See validation_framework.md for detailed patterns

2. Data Consistency

What to Check:

Summary totals = sum of detail rows (within 1%)
Percentages sum to ~100% (within 0.5% for rounding)
Contribution analysis for non-additive metrics
Metrics align across all sections
No contradictory statements in findings

Pattern:

1. Extract key totals from each section
2. Calculate expected relationships
3. Flag deviations > 1% threshold (informational)
4. Investigate root cause (usually query mismatch)

Special Case: Contribution Analysis for Non-Additive Metrics

When contributions don't sum to 100%, check if metric is additive:

Non-Additive Metrics (Gap is EXPECTED):

Median, Percentiles (P25, P75, P90)
Mode, Min/Max
Ratios (ROAS = Revenue/Spend)
Geometric mean

Response for non-additive:

ℹ️ Contribution Analysis Note (MEDIAN decomposition):
   - Sum of contributions: 3.1pp
   - Actual total change: 4.76pp
   - Gap: 1.7pp (36%)
   
This gap is EXPECTED and CORRECT for median analysis.
   
Why: Median is not an additive metric
   - Cannot decompose as simple weighted sums
   - Gap represents interaction effects between segments
   - Contributions show directional impact (which segments pushed median up/down)
   - Magnitudes are approximate, not exact attributions
   
Key insight: All segments contributed to decline, with Paid Search 
having largest impact due to volume (65%) and magnitude (-2.1%).

Analysis proceeds - this is mathematically correct.

Additive Metrics (Should Sum):

Mean (arithmetic average)
Sum, Count
Proportions (when all categories included)

Response for additive with gap:

⚠️ Contribution Analysis Issue:
   - Sum of contributions: 3.2%
   - Actual total change: 4.5%
   - Gap: 1.3% (should be <5% for additive metrics)
   
For MEAN decomposition, contributions should sum to total.
Possible causes:
   - Missing segments not included
   - Calculation error in weights
   - Different time periods for segments vs total
   
Recommendation: Verify all segments included and weights sum to 100%.
Analysis proceeds but results may be incomplete.

See validation_framework.md for implementation

3. Cross-Source Reconciliation

When metrics exist in multiple tables:

Step 1: Identify all relevant tables
Step 2: Document expected relationship
        (equal, subset, filtered version, etc.)
Step 3: Calculate conversion/attrition rates
Step 4: Flag unexpected drops (>20% usually indicates issue)
Step 5: Trace sample records through pipeline

See common_pitfalls.md for real example

4. Null Value Detection

Informational - Not Blocking:

Check for null values in critical fields and inform the user so they're aware, but allow analysis to proceed.

Quick Check:

-- Template query for null checking
SELECT 
    COUNT(*) as total,
    SUM(CASE WHEN field IS NULL THEN 1 ELSE 0 END) as null_count,
    ROUND(100.0 * null_count / total, 1) as null_pct
FROM your_table;

What to Report:

Primary keys: >0% null (note as data quality concern)
Foreign keys: >5% null (note potential join issues)
Price/revenue fields: >10% null (note impact on averages/medians)
Sudden spike in null % (note potential pipeline problem)

Response Pattern:

ℹ️ Data Quality Note: Field X has 12% null values (1,234 of 10,000 records).
   This may affect averages/medians. Analysis will proceed with available data.

See validation_framework.md for full framework

5. Date Range & Period Validation

For Experiments/A/B Tests:

Critical rule: When day-of-week patterns exist, periods MUST be X*7 days

# Quick check
def validate_period_balance(df, date_col, period_col):
    for period in df[period_col].unique():
        n_days = len(df[df[period_col] == period])
        if n_days % 7 != 0:
            print(f"⚠️ {period}: {n_days} days ({n_days/7:.1f} weeks) - UNBALANCED")
            print(f"   Fix: Use {n_days//7 * 7} or {(n_days//7 + 1) * 7} days")
        else:
            print(f"✓ {period}: {n_days} days ({n_days//7} weeks) - BALANCED")

See common_pitfalls.md for real impact

6. Anomaly Detection

Statistical Outlier Detection:

from scipy import stats

# Z-score method (3σ = 99.7th percentile)
z_scores = np.abs(stats.zscore(data))
outliers = data[z_scores > 3]

What to Flag (Informational):

Spend anomalies: >3σ from 30-day mean
Volume spikes: Day-over-day change >50%
Performance anomalies: ROAS doubles suddenly
Zero values: Sustained zeros in active metrics

See validation_framework.md for implementation

7. Vertical Filter Validation

Critical for Real Estate Data:

When user mentions specific lead types, always verify the vertical filter is present.

Lead Type Mentions that Trigger Validation:

"for sale" or "buy leads" → Check for vertical = 'for_sale' or submitted_lead_vertical = 'for_sale'
"for rent" or "rental leads" → Check for vertical = 'for_rent'
"seller leads" → Check for vertical = 'seller' or lead category filters

Validation Pattern:

-- CORRECT: Query with vertical filter
SELECT COUNT(*) as leads
FROM leads_table
WHERE event_date >= '2025-10-01'
  AND vertical = 'for_sale';  -- ✓ Vertical filter present

-- INCORRECT: Missing vertical filter
SELECT COUNT(*) as leads  
FROM leads_table
WHERE event_date >= '2025-10-01';  -- ❌ No vertical filter!

What to Do:

Check the query for vertical/lead type filters
If missing, ask user: "I notice there's no vertical filter - should this be limited to for_sale leads, or include all verticals?"
If present, verify it matches what user mentioned
Common column names: vertical, submitted_lead_vertical, lead_vertical, LEAD_VERTICAL

Why This Matters:

Mixing verticals can skew metrics significantly
for_sale vs for_rent have different performance characteristics
Missing filter = including data user didn't intend

See validation_framework.md for examples

8. Metric Calculation Validation

Ensures metrics are calculated appropriately for the data distribution:

Auto-detect and recommend:

Median vs Mean based on skewness
Rate normalization for time comparisons
Weighted vs simple averages
Percentage calculation consistency

Common Issues:

# Issue 1: Using mean for skewed data
Data: [$50K, $80K, $90K, $2M, $5M]
Mean: $1.4M (dominated by outliers)
Median: $90K (representative value)
→ Use MEDIAN for price data

# Issue 2: Comparing unnormalized rates
PRE: 102,085 leads / 28 days
POST: 85,953 leads / 21 days
→ Must normalize to same time unit (daily/weekly)

# Issue 3: Wrong percentage base
Calculating "% change" but using different denominators
→ Always use consistent baseline

Validation Pattern:

1. Check data distribution (skewness)
2. If |skewness| > 1.0 and using mean → Recommend median
3. For rate comparisons → Verify same time scale
4. For percentages → Confirm consistent denominator

Response is informational:

ℹ️ Metric Selection Note:
   - Data is highly skewed (skewness: 2.4)
   - Mean: $450,000
   - Median: $285,000
   - Difference: 58% higher mean due to outliers
   
Recommendation: Consider using MEDIAN for more representative value.
Analysis proceeds with current metric.

See validation_framework.md for implementation

9. Aggregation Grain Validation

Ensures metrics aggregate at the correct entity level:

What to check:

Are there duplicate records inflating counts?
Is calculation per-record or per-entity?
Do multiple records per entity affect metrics?

Example Issue:

MCD table: 143K records, 62K unique inquiries
Per-record avg EFR: $12.48 (WRONG - includes duplicates)
Per-inquiry avg EFR: $25+ (CORRECT - one per entity)

Problem: Multiple records per inquiry with $0 EFR dilute average

Validation Pattern:

1. Count total records vs unique entities
2. If records > entities × 1.2 (>20% duplication):
   - Calculate metric both ways
   - Compare per-record vs per-entity
   - If >10% difference → Note grain issue
3. Recommend explicit GROUP BY entity_id

Response is informational:

ℹ️ Aggregation Grain Note:
   - 143,245 records but only 62,866 unique inquiries (2.3x duplication)
   - Per-record average: $12.48
   - Per-inquiry average: $25.14
   - Difference: 101% (doubled!)
   
Recommendation: Consider grouping to inquiry level first:
   SELECT inquiry_id, MAX(efr) as efr
   FROM table GROUP BY inquiry_id
   
Analysis proceeds - user can decide if regrouping needed.

See validation_framework.md for detailed patterns

Loading Reference Files

When to load validation_framework.md:

Implementing any validation check
Need specific SQL/Python patterns
Building custom validation logic
First time using a validation category

When to load common_pitfalls.md:

Investigating unexpected results
Debugging calculation errors
Learning from past mistakes
Quick reference for specific error types

Quick Decision Tree

User says "validate this query/analysis"
  ↓
Load validation_framework.md
  ↓
  ├─ Simple query check → Section 1 (Query Correctness)
  ├─ Check consistency → Section 2 (Data Consistency)  
  ├─ Multiple tables → Section 3 (Cross-Source)
  ├─ Missing data? → Section 4 (Null Detection)
  ├─ Experiment dates → Section 5 (Date Range Validation)
  ├─ Unusual patterns → Section 6 (Anomaly Detection)
  ├─ Mentions "for sale" / "buy leads" → Section 7 (Vertical Filter)
  ├─ Metric selection questions → Section 8 (Metric Calculation)
  └─ Multiple records per entity → Section 9 (Aggregation Grain)

User says "why don't these numbers match" or "this seems wrong"
  ↓
Load common_pitfalls.md first
  ↓
Identify which pitfall pattern matches
  ↓
Apply specific fix from that section
  ↓
Then load validation_framework.md for comprehensive check

User mentions specific lead types ("for sale", "rental", "seller")
  ↓
Check Section 7: Vertical Filter Validation
  ↓
Verify query has appropriate vertical filter
  ↓
If missing, confirm with user which vertical(s) to include

User asks "should I use median or mean?" or "how to compare these rates?"
  ↓
Check Section 8: Metric Calculation Validation
  ↓
Analyze data distribution
  ↓
Recommend appropriate metric

User has unexpectedly high/low averages
  ↓
Check Section 9: Aggregation Grain Validation
  ↓
Verify records vs unique entities
  ↓
Confirm aggregation level

User asks "why don't contributions sum to 100%?"
  ↓
Check metric type (Section 2: Contribution Analysis)
  ↓
  ├─ MEDIAN/PERCENTILE → Explain gap is expected
  └─ MEAN/SUM → Check for missing segments

Validation Checklist

Before Running Analysis:

Date ranges match definition
Periods are balanced (X*7 days if day-of-week patterns exist)
Critical fields checked for nulls (informational note if high %)
Query has proper GROUP BY and NULLIF
Vertical filter present when user mentions lead type
Metric choice appropriate for data distribution (median for skewed)
Aggregation grain specified (per-record vs per-entity)

During Analysis:

Summary totals = detail sums
Percentages sum to ~100% (99.5-100.5% tolerance for rounding)
Contribution analysis validated for metric type:
- Non-additive (median/percentile): Gap noted as expected
- Additive (mean/sum): Gap <5% or explained
Anomalies flagged (informational)
Cross-section consistency validated
Null percentages noted in results (if significant)
Rates normalized to same time scale
No duplicate entity inflation

Before Publication:

All queries reviewed for correctness
Cross-source reconciliation complete
Data quality issues documented
Sensitivity analysis performed (for key findings)
Metric calculations verified
Aggregation level confirmed

Common Mistakes to Avoid

Assuming data is correct - Always validate, even if results look reasonable
Ignoring small discrepancies - 1% differences often indicate bigger issues
Comparing unbalanced periods - Critical for time-series analysis
Trusting single source - Cross-validate important metrics
Skipping pre-checks - Easier to catch errors before analysis than after
Using mean for skewed data - Price data almost always needs median
Comparing unnormalized rates - Always use same time scale
Ignoring duplicate records - Multiple records per entity distort averages
Expecting median contributions to sum to 100% - Non-additive metrics have gaps

Integration with Other Skills

Use with real-estate-marketing-analytics:

Validate marketing metrics before analysis
Check campaign performance calculations
Reconcile lead counts across systems
Verify ROAS and EFR calculations
Ensure proper vertical filtering

Use with statistical analysis:

Validate experimental design
Verify balanced control/treatment groups
Validate statistical test prerequisites
Check metric appropriateness

Use before creating deliverables:

Final validation before reports
Cross-check before presentations
Verify before sharing with stakeholders

Recovery Pattern

When You Find an Error:

Stop and Document
- What's wrong
- How discovered
- Correct approach
Assess Impact
- Which sections affected
- How much numbers change
- Do conclusions change
Re-run Corrected Analysis
- Update all affected queries
- Validate consistency
- Re-check everything
Prevent Recurrence
- Add to validation checklist
- Document in common_pitfalls.md
- Create reusable validation query

Success Criteria

You're using this skill effectively when:

Catching errors BEFORE stakeholders do
Proactively validating instead of reactively debugging
Building validation into standard workflow
Documenting validation checks performed
Explaining data quality caveats when presenting results
Choosing appropriate metrics for data distribution
Understanding aggregation grain in your queries
Recognizing when contribution gaps are expected (non-additive metrics)

data-quality-validation

Install Skill

SKILL.md

Data Quality & Validation

When to Use This Skill

Validation Philosophy: Inform, Don't Block

What This Means

Why This Approach

What Gets Validated (All Informational)

Response Pattern

Core Workflow

Standard Validation Process

Validation Categories

1. Query Correctness

2. Data Consistency

3. Cross-Source Reconciliation

4. Null Value Detection

5. Date Range & Period Validation

6. Anomaly Detection

7. Vertical Filter Validation

8. Metric Calculation Validation

9. Aggregation Grain Validation

Loading Reference Files

Quick Decision Tree

Validation Checklist

Common Mistakes to Avoid

Integration with Other Skills

Recovery Pattern

Success Criteria