| name | experiment-design-methodology |
| description | This skill should be used when designing A/B tests, feature experiments, or data-driven product decisions - covers hypothesis formation, experiment design, statistical significance, sample size calculation, and coordination with experiment-tracker agent for rigorous experimentation. |
Experiment Design Methodology
Overview
Design rigorous experiments that produce actionable insights. Avoid common pitfalls like stopping tests early, testing multiple variables, or making decisions without statistical significance.
Core principle: Strong opinions, weakly held. Validate with data, not intuition.
When to Use
Use when:
- Testing new features or changes
- Optimizing conversion rates
- Making product decisions
- Comparing alternatives
- Validating hypotheses
- Planning A/B tests
The Experiment Framework
1. Form Hypothesis
Template: "If we [change], then [metric] will [increase/decrease] by [amount] because [reasoning]"
Examples: ✅ "If we reduce signup form from 5 fields to 2, then completion rate will increase by 20% because less friction" ✅ "If we add social proof above CTA, then click-through rate will increase by 15% because trust signals" ❌ "Let's try a different button color" (no hypothesis)
Good hypothesis characteristics:
- Specific change
- Measurable outcome
- Predicted magnitude
- Clear reasoning
2. Define Success Metric
Primary metric (one only):
- The key outcome you care about
- What constitutes "success"
Examples:
- Signup conversion rate
- Revenue per user
- Feature adoption rate
- Retention day 7
Secondary metrics (track but don't optimize for):
- Related metrics that might be affected
- Help understand trade-offs
Guardrail metrics (must not degrade):
- Revenue
- User satisfaction
- Key user flows
3. Calculate Sample Size
Inputs needed:
- Baseline conversion rate
- Minimum detectable effect
- Statistical significance (usually 95%)
- Statistical power (usually 80%)
Example calculation:
Baseline: 20% signup rate
MDE: 2% (looking for 20% → 22%)
Significance: 95%
Power: 80%
Needed: ~2,000 users per variant
(Use online calculator: evanmiller.org/ab-testing/sample-size.html)
Runtime estimate:
Users per day: 500
Users needed per variant: 2,000
Variants: 2 (A and B)
Total needed: 4,000 users
Days to run: 8 days minimum
4. Design Test
A/B test structure:
- Control (A): Current version
- Variant (B): One change only
Critical rules:
- Change ONE thing only
- 50/50 traffic split
- Random assignment
- Run until statistical significance
- Don't peek early (increases false positives)
5. Implementation
Feature flags pattern:
// Simple feature flag
const variant = userId % 2 === 0 ? 'A' : 'B'
// Track variant
trackEvent('experiment_view', {
experimentId: 'signup_form_length',
variant: variant
})
// Render appropriate version
if (variant === 'A') {
return <LongSignupForm />
} else {
return <ShortSignupForm />
}
Using PostHog (recommended):
import { useFeatureFlagVariant } from 'posthog-js/react'
const variant = useFeatureFlagVariant('signup_form_length')
return variant === 'control'
? <LongSignupForm />
: <ShortSignupForm />
6. Run Experiment
Don't stop until:
- Statistical significance reached (p < 0.05)
- Minimum sample size achieved
- At least 1 week elapsed (account for day-of-week effects)
- At least 1 full business cycle (if B2B)
Stopping early = false positives
7. Analyze Results
Calculate statistical significance:
Use chi-square test or t-test
p-value < 0.05 = statistically significant
Check for:
- Practical significance (is lift worth it?)
- Consistency across segments
- Impact on guardrail metrics
- Novelty effects (will it last?)
8. Make Decision
Ship winner if:
- ✅ Statistically significant (p < 0.05)
- ✅ Practically significant (meaningful improvement)
- ✅ No negative impact on guardrails
- ✅ Consistent across key segments
Keep testing if:
- ⚠️ Inconclusive results
- ⚠️ Mixed segment performance
- ⚠️ Questions about long-term impact
Roll back if:
- ❌ Negative impact
- ❌ Worse than control
- ❌ Breaks guardrail metrics
Common Experiment Types
A/B Test (2 variants)
Use for:
- Testing one clear alternative
- Simple yes/no decisions
Example:
- Control: Long signup form
- Variant: Short signup form
A/B/C Test (3+ variants)
Use for:
- Testing multiple alternatives
- Finding optimal value (e.g., price points)
Example:
- Control: $9.99/month
- Variant B: $14.99/month
- Variant C: $19.99/month
Warning: Needs more traffic (split 3+ ways)
Multivariate Test
Use for:
- Testing combinations of changes
- Understanding interaction effects
Example: Test button color × button text:
- Red + "Sign Up" = Variant A
- Red + "Get Started" = Variant B
- Blue + "Sign Up" = Variant C
- Blue + "Get Started" = Variant D
Warning: Needs 4x traffic of A/B test
Sequential Testing
Use for:
- Iterative improvements
- Building on learnings
Pattern:
Week 1: Test headline variations → Winner
Week 2: Test CTA copy (using winning headline) → Winner
Week 3: Test social proof placement (using winners) → Winner
Experiment Design Checklist
Before launching:
- Clear hypothesis written
- Primary metric defined
- Sample size calculated
- Runtime estimated (can we wait this long?)
- Implementation plan ready
- Tracking instrumented
- Randomization tested
- No external changes planned (confounding variables)
During experiment:
- Monitor daily (but don't stop early)
- Check for bugs/implementation issues
- Verify equal traffic split
- Track guardrail metrics
After experiment:
- Statistical significance calculated
- Practical significance assessed
- Segment analysis done
- Decision documented
- Learnings recorded
Statistical Concepts
Statistical Significance (p-value)
What it means:
- p < 0.05: <5% chance result is random
- p < 0.01: <1% chance (stronger evidence)
Not a guarantee:
- False positives happen (5% with p < 0.05)
- Need adequate sample size
- Multiple tests increase false positive rate
Confidence Intervals
Example: "Variant B improved conversion by 2.3% (95% CI: 1.2% to 3.4%)"
Interpretation:
- Best estimate: 2.3% improvement
- 95% confident true value is between 1.2% and 3.4%
- Lower bound (1.2%) > 0, so we're confident there's an improvement
Sample Size
Factors:
- Baseline conversion rate
- Minimum detectable effect (smaller = more samples needed)
- Statistical significance (95% standard)
- Statistical power (80% standard)
Rule of thumb: Need ~385 conversions per variant for:
- 95% significance
- 80% power
- Detecting 20% relative improvement
Common Experiment Mistakes
| Mistake | Impact | Prevention |
|---|---|---|
| Peeking early | False positives | Wait for full sample size |
| Testing multiple things | Can't isolate cause | One variable per test |
| No hypothesis | Unclear what to measure | Write hypothesis first |
| Sample too small | Inconclusive | Calculate required size |
| Stopping too soon | Day-of-week effects | Run full week minimum |
| Ignoring guardrails | Win metric but hurt business | Track all key metrics |
| Not accounting for seasonality | Confounding variables | Compare to same period |
Experiment Coordination with Agents
Use experiment-tracker agent for:
- Setting up experiments
- Tracking results over time
- Calculating statistical significance
- Making roll-out decisions
Use analytics-reporter agent for:
- Understanding baseline metrics
- Segment analysis
- Funnel visualization
- Historical trends
Example workflow:
1. Form hypothesis
2. @experiment-tracker set up A/B test for [hypothesis]
3. @experiment-tracker check results (after 1 week)
4. @analytics-reporter analyze segments for [experiment]
5. Make data-driven decision
Quick Decisions Without Full Experiments
When to skip formal A/B testing:
- Obvious improvements (fixing bugs, removing friction)
- Time-sensitive opportunities
- Very low traffic (<100 users/day)
- Qualitative improvements
- Compliance/legal requirements
Use instead:
- Gradual rollout (5% → 25% → 50% → 100%)
- Monitor metrics closely
- Quick revert if issues
Resources
- Sample size calculator: evanmiller.org/ab-testing/sample-size.html
- Stats significance calculator: abtestguide.com/calc
- PostHog (feature flags + experiments): posthog.com
Experimentation is how good products become great products. Test rigorously, decide confidently.