name	hypothesis-testing
description	Rigorous hypothesis testing process for data analysis - formulate hypotheses before looking at data, design tests, analyze systematically, interpret with skepticism

Hypothesis Testing Process

Overview

This skill guides you through rigorous hypothesis testing in data analysis. The core principle is scientific rigor: formulate your hypotheses BEFORE looking at the data to avoid p-hacking and confirmation bias.

Hypothesis testing is appropriate when:

You have a specific claim or belief to test
You want to avoid confirmation bias
The question can be framed as "Does X affect Y?"
You need defensible, reproducible conclusions

Prerequisites

Before using this skill, you MUST:

Have data imported into SQLite database using the importing-data skill
Have data quality validated and cleaned using the cleaning-data skill (MANDATORY - never skip)
Have created an analysis workspace (just start-analysis hypothesis-testing <name>)
Understand the basic structure of your data tables
Be familiar with the component skills:
- understanding-data - for data profiling
- writing-queries - for SQL query construction
- interpreting-results - for result analysis
- creating-visualizations - for text-based visualizations

Mandatory Process Structure

You MUST use TodoWrite to track progress through all 5 phases. Create todos at the start:

- Phase 1: Hypothesis Formulation (H0/H1) - pending
- Phase 2: Test Design - pending
- Phase 3: Data Analysis - pending
- Phase 4: Result Interpretation - pending
- Phase 5: Conclusion and Follow-up - pending

Update status as you progress. Mark phases complete ONLY after checkpoint verification.

Phase 1: Hypothesis Formulation

CHECKPOINT: Before proceeding, you MUST have:

Written null hypothesis (H0) - specific, testable statement
Written alternative hypothesis (H1) - specific, directional or non-directional
Documented WHY you're testing this (user goal, business context)
Saved to 01 - hypothesis-formulation.md

Instructions

Ask clarifying questions about the user's analytical goal
- What belief or claim needs testing?
- What would constitute evidence for/against this belief?
- What's the practical significance threshold?
Write hypotheses in 01 - hypothesis-formulation.md with: ./templates/phase-1.md
STOP and get user confirmation
- Read the hypotheses back to the user
- Confirm this is what they want to test
- Do NOT proceed to Phase 2 until confirmed

Common Rationalization: "I'll just peek at the data to refine the hypothesis" Reality: This creates confirmation bias. Formulate hypothesis FIRST, always.

Phase 2: Test Design

CHECKPOINT: Before proceeding, you MUST have:

Defined exact metrics to calculate
Specified comparison method (segments, time periods, etc.)
Listed required data checks (missing values, outliers, sample sizes)
Written test plan WITHOUT executing any queries yet
Saved to 02 - test-design.md

Instructions

Design your test in 02 - test-design.md with: ./templates/phase-2.md
STOP and verify test design
- Does this test actually answer the hypothesis?
- Are there simpler/better ways to test it?
- Have you planned for data quality issues?
- Get user confirmation before proceeding

Common Rationalization: "I'll just run one quick query to see if the data exists" Reality: That's looking at data before test design is complete. Finish design FIRST.

Phase 3: Data Analysis

CHECKPOINT: Before proceeding, you MUST have:

Executed schema/data quality checks
Documented any data issues found
Executed main analysis query
Executed supporting analysis queries
Saved all queries and results to numbered files (03-, 04-, etc.)

Instructions

Execute queries in order, one file per query

Create separate numbered files:

03 - schema-check.md
04 - data-quality-check.md
05 - main-analysis.md
06 - supporting-analysis.md (if needed)

For each query file, use this structure: ./templates/phase-3-query.md
Execute queries using appropriate tool
- Use SQLite CLI or database tool
- Copy EXACT results (don't summarize or round)
- If query fails, document error and revise
Handle data quality issues
- If quality checks reveal problems, document them
- Decide: exclude bad data, transform it, or note as limitation
- Update test design if needed (document why)

Common Rationalization: "I'll skip data quality checks since the data looks fine" Reality: ALWAYS check data quality. Surprises happen. Document what you checked.

Common Rationalization: "I'll combine all queries into one file to save time" Reality: Separate files create clear audit trail. One query, one file.

Phase 4: Result Interpretation

CHECKPOINT: Before proceeding, you MUST have:

Described what the data shows (facts only)
Calculated relevant comparisons (% differences, ratios)
Considered alternative explanations
Assessed confounding factors identified in Phase 1
Stated whether results support/reject H0
Saved to 07 - interpretation.md (or next number)

Instructions

Create interpretation file: XX - interpretation.md with: ./templates/phase-4.md
Be intellectually honest
- State limitations clearly
- Don't overstate conclusions
- Acknowledge uncertainty
- Suggest what additional data would help

Common Rationalization: "The pattern is obvious, I don't need to consider alternatives" Reality: Always consider alternatives. Obvious patterns often have surprising explanations.

Common Rationalization: "I'll downplay limitations so the conclusion looks stronger" Reality: Stating limitations INCREASES credibility. Be honest about uncertainty.

Phase 5: Conclusion and Follow-up

CHECKPOINT: Before proceeding, you MUST have:

Written clear, actionable conclusion
Listed 2-3 specific follow-up questions
Suggested concrete next steps
Saved to 08 - conclusion.md (or next number)
Updated 00 - overview.md with summary

Instructions

Create conclusion file: XX - conclusion.md with: ./templates/phase-5.md
Update overview file: 00 - overview.md

Add summary section with: ./templates/overview-summary.md

Final checklist before marking complete
- All phases documented in numbered files
- Queries and results included
- Limitations acknowledged
- Follow-up questions specified
- Overview updated
- User informed of conclusions

Common Rationalization: "I found the answer, I'm done" Reality: Good analysis always identifies the NEXT question. List follow-ups.

Common Rationalization: "I'll skip updating the overview since it's all in the detailed files" Reality: Overview provides navigational summary. Always update it.

Complete Example: Day-of-Week Analysis

Example Scenario

User wants to test: "Do we get more sales on weekends?"

Phase 1: Hypothesis Formulation (01 - hypothesis-formulation.md)

# Hypothesis Formulation

## Analytical Goal
Determine if weekend days (Saturday, Sunday) have higher sales than weekday days (Monday-Friday)

## Context
Business wants to optimize staffing and inventory. If weekends are significantly busier, we should staff up. If not, we can balance resources across the week.

## Hypotheses

### Null Hypothesis (H0)
Weekend days and weekday days have equal average sales. Any observed differences are due to random variation.

### Alternative Hypothesis (H1)
Weekend days have significantly different average sales compared to weekday days.

## Success Criteria
- Difference >25% would be practically meaningful (enough to justify staffing changes)
- Pattern should be consistent across multiple weeks

## Potential Confounds
- Holidays: Holiday Monday might inflate weekday averages
- Promotions: Weekend promotions might inflate weekend sales
- Store hours: Different hours on weekends might affect opportunity
- Seasonality: Analysis period might not be representative of full year

Phase 2: Test Design (02 - test-design.md)

# Test Design

## Metrics

### Primary Metric
Average daily sales amount: SUM(amount) / COUNT(DISTINCT date) for weekend vs weekday groups

### Supporting Metrics
- Total transaction count per day-of-week (to assess sample size)
- Median sales per transaction (to check if average is representative)
- Week-over-week consistency (to see if pattern is stable)

## Comparison Structure
Group days into two categories:
- Weekend: Saturday (day 6), Sunday (day 0)
- Weekday: Monday-Friday (days 1-5)

Calculate average daily sales for each group, compare the ratio.

## Data Requirements

### Required Tables/Columns
- Table: `sales` (or similar)
  - `transaction_date` (DATE or TEXT in ISO format)
  - `amount` (NUMERIC)

### Data Quality Checks
1. Check for NULL dates or amounts
2. Verify date range covers complete weeks
3. Confirm adequate sample size (>100 transactions per day-of-week)
4. Identify any outlier days (holidays, system outages)

### Queries Needed
1. Schema check: PRAGMA table_info(sales)
2. Data quality: NULL checks, date range, outlier detection
3. Main analysis: Daily sales by day-of-week
4. Supporting: Transaction counts, weekly pattern consistency

## Statistical Considerations
Look for differences >25% between weekend and weekday averages. Check if pattern is consistent across multiple weeks (not just one unusual weekend).

Phase 3: Data Analysis (03-06 files)

03 - schema-check.md:

# Schema Check

## Rationale
Verify the sales table exists and has the required columns for date and amount

## Query
```sql
PRAGMA table_info(sales);

Results

cid | name              | type    | notnull | dflt_value | pk
0   | transaction_id    | INTEGER | 0       | NULL       | 1
1   | transaction_date  | TEXT    | 0       | NULL       | 0
2   | amount            | REAL    | 0       | NULL       | 0
3   | product_id        | INTEGER | 0       | NULL       | 0

Initial Observations

Table sales exists
Has transaction_date column (TEXT type - will need STRFTIME to extract day-of-week)
Has amount column (REAL type - good for calculations)
Columns allow NULLs - need to check data quality


**04 - data-quality-check.md:**
```markdown
# Data Quality Check

## Rationale
Verify data completeness and identify any issues before main analysis

## Query
```sql
SELECT
  COUNT(*) as total_rows,
  COUNT(CASE WHEN transaction_date IS NULL THEN 1 END) as null_dates,
  COUNT(CASE WHEN amount IS NULL THEN 1 END) as null_amounts,
  MIN(transaction_date) as earliest_date,
  MAX(transaction_date) as latest_date,
  MIN(amount) as min_amount,
  MAX(amount) as max_amount
FROM sales;

Results

total_rows | null_dates | null_amounts | earliest_date | latest_date | min_amount | max_amount
15847      | 0          | 0            | 2024-01-01    | 2024-03-31  | 5.00       | 899.99

Initial Observations

15,847 total transactions
No NULL dates or amounts (clean data)
3 months of data (Jan-Mar 2024)
13 complete weeks (91 days / 7)
Amount range: $5 to $899.99 (no obvious outliers)


**05 - main-analysis.md:**
```markdown
# Main Analysis: Sales by Day of Week

## Rationale
Calculate average sales by day of week to test if weekend days differ from weekdays

## Query
```sql
SELECT
  CAST(STRFTIME('%w', transaction_date) AS INTEGER) as day_of_week,
  CASE
    WHEN CAST(STRFTIME('%w', transaction_date) AS INTEGER) IN (0, 6) THEN 'Weekend'
    ELSE 'Weekday'
  END as day_type,
  COUNT(DISTINCT transaction_date) as days_count,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  ROUND(SUM(amount) / COUNT(DISTINCT transaction_date), 2) as avg_daily_sales,
  ROUND(AVG(amount), 2) as avg_transaction_amount
FROM sales
GROUP BY day_of_week, day_type
ORDER BY day_of_week;

Results

day_of_week | day_type | days_count | transaction_count | total_sales | avg_daily_sales | avg_transaction_amount
0           | Weekend  | 13         | 1456              | 52389.44    | 4030.00        | 35.98
1           | Weekday  | 13         | 2234              | 102345.67   | 7872.90        | 45.81
2           | Weekday  | 13         | 2198              | 98234.55    | 7556.50        | 44.70
3           | Weekday  | 13         | 2301              | 105678.90   | 8129.15        | 45.93
4           | Weekday  | 13         | 2345              | 108901.23   | 8377.02        | 46.44
5           | Weekday  | 13         | 2456              | 115432.11   | 8879.39        | 47.00
6           | Weekend  | 13         | 2857              | 98765.43    | 7597.34        | 34.56

Initial Observations

Sunday (0): $4,030 avg daily sales, 1,456 transactions, $35.98 avg transaction
Monday-Friday (1-5): $7,800-$8,800 avg daily sales, 2,200-2,500 transactions, $44-47 avg transaction
Saturday (6): $7,597 avg daily sales, 2,857 transactions, $34.56 avg transaction
Saturday has HIGH transaction count but LOW per-transaction amount
Sunday has both low transaction count AND low per-transaction amount


**06 - supporting-analysis.md:**
```markdown
# Supporting Analysis: Weekend vs Weekday Comparison

## Rationale
Aggregate weekend and weekday categories to test the hypothesis directly

## Query
```sql
SELECT
  CASE
    WHEN CAST(STRFTIME('%w', transaction_date) AS INTEGER) IN (0, 6) THEN 'Weekend'
    ELSE 'Weekday'
  END as day_type,
  COUNT(DISTINCT transaction_date) as days_count,
  COUNT(*) as transaction_count,
  SUM(amount) as total_sales,
  ROUND(SUM(amount) / COUNT(DISTINCT transaction_date), 2) as avg_daily_sales,
  ROUND(AVG(amount), 2) as avg_transaction_amount
FROM sales
GROUP BY day_type;

Results

day_type | days_count | transaction_count | total_sales | avg_daily_sales | avg_transaction_amount
Weekday  | 65         | 11534             | 530592.46   | 8163.00         | 46.01
Weekend  | 26         | 4313              | 151154.87   | 5813.65         | 35.04

Initial Observations

Weekday average: $8,163 per day
Weekend average: $5,814 per day
Difference: $2,349 (29% lower on weekends)
Weekend has 62% fewer transactions per day
Weekend transactions are 24% smaller on average


### Phase 4: Interpretation (07 - interpretation.md)

```markdown
# Result Interpretation

## Summary of Findings

### Primary Metric Results
Weekends have significantly LOWER sales than weekdays:
- Weekday avg: $8,163 per day
- Weekend avg: $5,814 per day
- Difference: 29% lower on weekends

This contradicts the original hypothesis that weekends would have higher sales.

### Statistical Assessment
- Magnitude: 29% difference is both statistically meaningful (large sample) and practically significant (>25% threshold)
- Consistency: Pattern holds across 13 weeks of data
- Practical importance: This magnitude justifies different staffing/inventory approaches

## Alternative Explanations

1. **Store hours:** Maybe stores close earlier on weekends
   - Evidence: Transaction count is 62% lower on weekends, suggesting operational constraint
   - Impact: Could fully explain lower daily totals if hours are reduced

2. **Customer behavior:** Maybe weekend shoppers buy smaller items
   - Evidence: Average transaction is 24% smaller on weekends ($35 vs $46)
   - Impact: Explains some but not all of the difference

3. **Product mix:** Maybe high-value products aren't available on weekends
   - Evidence: Would need product category data to verify
   - Impact: Unknown

4. **Sunday effect:** Maybe Sunday drags down weekend average
   - Evidence: Sunday has only $4,030 avg (47% lower than Saturday's $7,597)
   - Impact: If we exclude Sunday, Saturday is closer to weekdays but still 7% lower

## Hypothesis Test Result

### Null Hypothesis (H0)
Weekend days and weekday days have equal average sales.

### Decision
**REJECT H0**

### Rationale
The 29% difference between weekend and weekday average sales is large, consistent across 13 weeks, and exceeds our 25% practical significance threshold. The pattern is clear: weekends have lower sales than weekdays.

HOWEVER: This is the OPPOSITE of what we hypothesized. The alternative hypothesis stated "weekend days have significantly different sales" which is true, but we expected higher, not lower.

## Limitations

1. **Cannot determine causation:** Data shows weekends are lower but doesn't explain why
2. **Store operational factors unknown:** Don't have data on hours, staffing, inventory availability
3. **Sunday is dramatically lower:** Weekend average is heavily influenced by very low Sunday sales
4. **Seasonal effects:** 3 months (Jan-Mar) may not represent full year (could be post-holiday slump)
5. **No control for promotions/holidays:** Any holidays in the period could skew results

Phase 5: Conclusion (08 - conclusion.md)

# Conclusion and Follow-up

## Main Conclusion

Weekends have 29% lower average daily sales than weekdays, contradicting the initial hypothesis that weekends would be busier. This pattern is consistent across 13 weeks and appears driven by both fewer transactions (62% lower) and smaller average purchases (24% lower).

## Actionable Insights

**DO NOT increase weekend staffing based on this data.** The hypothesis that weekends are busier is not supported.

Instead:

1. **Investigate Sunday specifically:** With only $4,030 avg daily sales (vs $7,600 on Saturday), Sunday operations may not be profitable. Consider:
   - Reduced hours on Sunday
   - Sunday-specific promotions to drive traffic
   - Or closing on Sundays if fixed costs exceed contribution margin

2. **Understand operational constraints:** Before making staffing decisions, determine:
   - Are weekend hours reduced? (This might explain lower transaction counts)
   - Is weekend inventory limited? (This might explain smaller purchases)
   - Are certain high-value products unavailable on weekends?

3. **Focus weekday resources:** If pattern holds year-round, optimize for Monday-Friday peak demand

## Follow-up Questions

1. **What are actual store operating hours by day?**
   - Data needed: Store hours table, or time-of-day in transaction timestamps
   - Approach: Calculate sales per operating hour rather than per day
   - Why: Would distinguish "fewer hours" from "fewer customers per hour"

2. **Does the weekend pattern hold across all seasons?**
   - Data needed: Full year of transaction data
   - Approach: Repeat this analysis for each quarter
   - Why: Jan-Mar might be post-holiday slump; need to verify pattern is year-round

3. **Are there product category differences on weekends?**
   - Data needed: Product category or department in transaction records
   - Approach: Analyze category mix and average prices by day-of-week
   - Why: Would explain the 24% smaller average transaction on weekends

4. **How do weekends compare if we exclude Sunday?**
   - Data needed: Current dataset (already have this)
   - Approach: Re-run analysis treating Sunday separately
   - Why: Sunday is so dramatically lower it may be distorting the weekend average

## Confidence Level

**Medium confidence** in the finding that weekends are lower, but **low confidence** in the explanation.

Reasoning:
- Pattern is clear and consistent across 13 weeks (strengthens confidence)
- Sample sizes are adequate (4,313 weekend transactions is plenty)
- BUT: Cannot distinguish operational constraints from customer behavior (weakens confidence)
- AND: Only 3 months of data may not represent full year (weakens confidence)

**Next step:** Investigate store hours and expand to full year data before making operational changes.

Common Rationalizations

"I'll just look at the data first to understand it better"

Why this is wrong: Looking at data before formulating hypotheses creates confirmation bias. You'll unconsciously form hypotheses that match what you see, then "test" them. This isn't science.

Do instead: Formulate hypothesis from domain knowledge, business context, or theory. Then look at data.

"The hypothesis is obvious from the results, I don't need to write it down"

Why this is wrong: Hindsight bias makes everything seem obvious after you see the answer. Writing hypothesis first creates accountability.

Do instead: Always write H0 and H1 in Phase 1 before any query execution.

"I'll skip data quality checks since the data looks clean"

Why this is wrong: You can't know data is clean until you check. Surprises happen often.

Do instead: ALWAYS run data quality queries. Document what you checked and what you found (even if it's "no issues").

"The pattern is clear, I don't need to consider alternative explanations"

Why this is wrong: Obvious patterns often have non-obvious causes. Confounding factors are common.

Do instead: Always list 2-3 alternative explanations in Phase 4, even if you think they're unlikely.

"I'll combine multiple queries into one file for efficiency"

Why this is wrong: One file per query creates clear audit trail and makes analysis reproducible.

Do instead: One query, one file. Use consistent numbering (03-, 04-, etc.).

"I found the answer, analysis is complete"

Why this is wrong: Good analysis always identifies the next question. Every conclusion should raise new questions.

Do instead: Always list 2-3 follow-up questions in Phase 5.

"I'll downplay limitations so the conclusion looks stronger"

Why this is wrong: Acknowledging limitations increases credibility. Readers trust honest uncertainty more than false certainty.

Do instead: State limitations clearly. Be honest about what you don't know.

"I'll skip updating the overview since everything is in detailed files"

Why this is wrong: Overview provides navigation and quick summary. Future readers (including you) will thank you.

Do instead: Always update 00 - overview.md with results summary in Phase 5.

Summary

This skill ensures rigorous, reproducible hypothesis testing by:

Preventing confirmation bias: Formulate hypothesis BEFORE looking at data
Ensuring thoughtful design: Plan your test before executing queries
Creating audit trail: One query per file, with rationale and results
Demanding intellectual honesty: Consider alternatives, state limitations
Identifying next questions: Every analysis should suggest follow-up investigations

Follow this process and you'll produce defensible, credible analysis that stands up to scrutiny.

Install Skill

SKILL.md

Hypothesis Testing Process

Overview

Prerequisites

Mandatory Process Structure

Phase 1: Hypothesis Formulation

Instructions

Phase 2: Test Design

Instructions

Phase 3: Data Analysis

Instructions

Phase 4: Result Interpretation

Instructions

Phase 5: Conclusion and Follow-up

Instructions

Complete Example: Day-of-Week Analysis

Example Scenario

Phase 1: Hypothesis Formulation (01 - hypothesis-formulation.md)

Phase 2: Test Design (02 - test-design.md)

Phase 3: Data Analysis (03-06 files)

Results

Initial Observations

Results

Initial Observations

Results

Initial Observations

Results

Initial Observations

Phase 5: Conclusion (08 - conclusion.md)

Common Rationalizations

"I'll just look at the data first to understand it better"

"The hypothesis is obvious from the results, I don't need to write it down"

"I'll skip data quality checks since the data looks clean"

"The pattern is clear, I don't need to consider alternative explanations"

"I'll combine multiple queries into one file for efficiency"

"I found the answer, analysis is complete"

"I'll downplay limitations so the conclusion looks stronger"

"I'll skip updating the overview since everything is in detailed files"

Summary