name	experiment-design-methodology
description	This skill should be used when designing A/B tests, feature experiments, or data-driven product decisions - covers hypothesis formation, experiment design, statistical significance, sample size calculation, and coordination with experiment-tracker agent for rigorous experimentation.

Experiment Design Methodology

Overview

Design rigorous experiments that produce actionable insights. Avoid common pitfalls like stopping tests early, testing multiple variables, or making decisions without statistical significance.

Core principle: Strong opinions, weakly held. Validate with data, not intuition.

When to Use

Use when:

Testing new features or changes
Optimizing conversion rates
Making product decisions
Comparing alternatives
Validating hypotheses
Planning A/B tests

The Experiment Framework

1. Form Hypothesis

Template: "If we [change], then [metric] will [increase/decrease] by [amount] because [reasoning]"

Examples: ✅ "If we reduce signup form from 5 fields to 2, then completion rate will increase by 20% because less friction" ✅ "If we add social proof above CTA, then click-through rate will increase by 15% because trust signals" ❌ "Let's try a different button color" (no hypothesis)

Good hypothesis characteristics:

Specific change
Measurable outcome
Predicted magnitude
Clear reasoning

2. Define Success Metric

Primary metric (one only):

The key outcome you care about
What constitutes "success"

Examples:

Signup conversion rate
Revenue per user
Feature adoption rate
Retention day 7

Secondary metrics (track but don't optimize for):

Related metrics that might be affected
Help understand trade-offs

Guardrail metrics (must not degrade):

Revenue
User satisfaction
Key user flows

3. Calculate Sample Size

Inputs needed:

Baseline conversion rate
Minimum detectable effect
Statistical significance (usually 95%)
Statistical power (usually 80%)

Example calculation:

Baseline: 20% signup rate
MDE: 2% (looking for 20% → 22%)
Significance: 95%
Power: 80%

Needed: ~2,000 users per variant
(Use online calculator: evanmiller.org/ab-testing/sample-size.html)

Runtime estimate:

Users per day: 500
Users needed per variant: 2,000
Variants: 2 (A and B)

Total needed: 4,000 users
Days to run: 8 days minimum

4. Design Test

A/B test structure:

Control (A): Current version
Variant (B): One change only

Critical rules:

Change ONE thing only
50/50 traffic split
Random assignment
Run until statistical significance
Don't peek early (increases false positives)

5. Implementation

Feature flags pattern:

// Simple feature flag
const variant = userId % 2 === 0 ? 'A' : 'B'

// Track variant
trackEvent('experiment_view', {
  experimentId: 'signup_form_length',
  variant: variant
})

// Render appropriate version
if (variant === 'A') {
  return <LongSignupForm />
} else {
  return <ShortSignupForm />
}

Using PostHog (recommended):

import { useFeatureFlagVariant } from 'posthog-js/react'

const variant = useFeatureFlagVariant('signup_form_length')

return variant === 'control'
  ? <LongSignupForm />
  : <ShortSignupForm />

6. Run Experiment

Don't stop until:

Statistical significance reached (p < 0.05)
Minimum sample size achieved
At least 1 week elapsed (account for day-of-week effects)
At least 1 full business cycle (if B2B)

Stopping early = false positives

7. Analyze Results

Calculate statistical significance:

Use chi-square test or t-test
p-value < 0.05 = statistically significant

Check for:

Practical significance (is lift worth it?)
Consistency across segments
Impact on guardrail metrics
Novelty effects (will it last?)

8. Make Decision

Ship winner if:

✅ Statistically significant (p < 0.05)
✅ Practically significant (meaningful improvement)
✅ No negative impact on guardrails
✅ Consistent across key segments

Keep testing if:

⚠️ Inconclusive results
⚠️ Mixed segment performance
⚠️ Questions about long-term impact

Roll back if:

❌ Negative impact
❌ Worse than control
❌ Breaks guardrail metrics

Common Experiment Types

A/B Test (2 variants)

Use for:

Testing one clear alternative
Simple yes/no decisions

Example:

Control: Long signup form
Variant: Short signup form

A/B/C Test (3+ variants)

Use for:

Testing multiple alternatives
Finding optimal value (e.g., price points)

Example:

Control: $9.99/month
Variant B: $14.99/month
Variant C: $19.99/month

Warning: Needs more traffic (split 3+ ways)

Multivariate Test

Use for:

Testing combinations of changes
Understanding interaction effects

Example: Test button color × button text:

Red + "Sign Up" = Variant A
Red + "Get Started" = Variant B
Blue + "Sign Up" = Variant C
Blue + "Get Started" = Variant D

Warning: Needs 4x traffic of A/B test

Sequential Testing

Use for:

Iterative improvements
Building on learnings

Pattern:

Week 1: Test headline variations → Winner
Week 2: Test CTA copy (using winning headline) → Winner
Week 3: Test social proof placement (using winners) → Winner

Experiment Design Checklist

Before launching:

Clear hypothesis written
Primary metric defined
Sample size calculated
Runtime estimated (can we wait this long?)
Implementation plan ready
Tracking instrumented
Randomization tested
No external changes planned (confounding variables)

During experiment:

Monitor daily (but don't stop early)
Check for bugs/implementation issues
Verify equal traffic split
Track guardrail metrics

After experiment:

Statistical significance calculated
Practical significance assessed
Segment analysis done
Decision documented
Learnings recorded

Statistical Concepts

Statistical Significance (p-value)

What it means:

p < 0.05: <5% chance result is random
p < 0.01: <1% chance (stronger evidence)

Not a guarantee:

False positives happen (5% with p < 0.05)
Need adequate sample size
Multiple tests increase false positive rate

Confidence Intervals

Example: "Variant B improved conversion by 2.3% (95% CI: 1.2% to 3.4%)"

Interpretation:

Best estimate: 2.3% improvement
95% confident true value is between 1.2% and 3.4%
Lower bound (1.2%) > 0, so we're confident there's an improvement

Sample Size

Factors:

Baseline conversion rate
Minimum detectable effect (smaller = more samples needed)
Statistical significance (95% standard)
Statistical power (80% standard)

Rule of thumb: Need ~385 conversions per variant for:

95% significance
80% power
Detecting 20% relative improvement

Common Experiment Mistakes

Mistake	Impact	Prevention
Peeking early	False positives	Wait for full sample size
Testing multiple things	Can't isolate cause	One variable per test
No hypothesis	Unclear what to measure	Write hypothesis first
Sample too small	Inconclusive	Calculate required size
Stopping too soon	Day-of-week effects	Run full week minimum
Ignoring guardrails	Win metric but hurt business	Track all key metrics
Not accounting for seasonality	Confounding variables	Compare to same period

Experiment Coordination with Agents

Use experiment-tracker agent for:

Setting up experiments
Tracking results over time
Calculating statistical significance
Making roll-out decisions

Use analytics-reporter agent for:

Understanding baseline metrics
Segment analysis
Funnel visualization
Historical trends

Example workflow:

1. Form hypothesis
2. @experiment-tracker set up A/B test for [hypothesis]
3. @experiment-tracker check results (after 1 week)
4. @analytics-reporter analyze segments for [experiment]
5. Make data-driven decision

Quick Decisions Without Full Experiments

When to skip formal A/B testing:

Obvious improvements (fixing bugs, removing friction)
Time-sensitive opportunities
Very low traffic (<100 users/day)
Qualitative improvements
Compliance/legal requirements

Use instead:

Gradual rollout (5% → 25% → 50% → 100%)
Monitor metrics closely
Quick revert if issues

Resources

Sample size calculator: evanmiller.org/ab-testing/sample-size.html
Stats significance calculator: abtestguide.com/calc
PostHog (feature flags + experiments): posthog.com

Experimentation is how good products become great products. Test rigorously, decide confidently.

experiment-design-methodology

Install Skill

SKILL.md

Experiment Design Methodology

Overview

When to Use

The Experiment Framework

1. Form Hypothesis

2. Define Success Metric

3. Calculate Sample Size

4. Design Test

5. Implementation

6. Run Experiment

7. Analyze Results

8. Make Decision

Common Experiment Types

A/B Test (2 variants)

A/B/C Test (3+ variants)

Multivariate Test

Sequential Testing

Experiment Design Checklist

Statistical Concepts

Statistical Significance (p-value)

Confidence Intervals

Sample Size

Common Experiment Mistakes

Experiment Coordination with Agents

Use experiment-tracker agent for:

Use analytics-reporter agent for:

Example workflow:

Quick Decisions Without Full Experiments

Resources