Claude Code Plugins

Community-maintained marketplace

Feedback

experiment-design-methodology

@chriscarterux/chris-claude-stack
1
0

This skill should be used when designing A/B tests, feature experiments, or data-driven product decisions - covers hypothesis formation, experiment design, statistical significance, sample size calculation, and coordination with experiment-tracker agent for rigorous experimentation.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name experiment-design-methodology
description This skill should be used when designing A/B tests, feature experiments, or data-driven product decisions - covers hypothesis formation, experiment design, statistical significance, sample size calculation, and coordination with experiment-tracker agent for rigorous experimentation.

Experiment Design Methodology

Overview

Design rigorous experiments that produce actionable insights. Avoid common pitfalls like stopping tests early, testing multiple variables, or making decisions without statistical significance.

Core principle: Strong opinions, weakly held. Validate with data, not intuition.

When to Use

Use when:

  • Testing new features or changes
  • Optimizing conversion rates
  • Making product decisions
  • Comparing alternatives
  • Validating hypotheses
  • Planning A/B tests

The Experiment Framework

1. Form Hypothesis

Template: "If we [change], then [metric] will [increase/decrease] by [amount] because [reasoning]"

Examples: ✅ "If we reduce signup form from 5 fields to 2, then completion rate will increase by 20% because less friction" ✅ "If we add social proof above CTA, then click-through rate will increase by 15% because trust signals" ❌ "Let's try a different button color" (no hypothesis)

Good hypothesis characteristics:

  • Specific change
  • Measurable outcome
  • Predicted magnitude
  • Clear reasoning

2. Define Success Metric

Primary metric (one only):

  • The key outcome you care about
  • What constitutes "success"

Examples:

  • Signup conversion rate
  • Revenue per user
  • Feature adoption rate
  • Retention day 7

Secondary metrics (track but don't optimize for):

  • Related metrics that might be affected
  • Help understand trade-offs

Guardrail metrics (must not degrade):

  • Revenue
  • User satisfaction
  • Key user flows

3. Calculate Sample Size

Inputs needed:

  • Baseline conversion rate
  • Minimum detectable effect
  • Statistical significance (usually 95%)
  • Statistical power (usually 80%)

Example calculation:

Baseline: 20% signup rate
MDE: 2% (looking for 20% → 22%)
Significance: 95%
Power: 80%

Needed: ~2,000 users per variant
(Use online calculator: evanmiller.org/ab-testing/sample-size.html)

Runtime estimate:

Users per day: 500
Users needed per variant: 2,000
Variants: 2 (A and B)

Total needed: 4,000 users
Days to run: 8 days minimum

4. Design Test

A/B test structure:

  • Control (A): Current version
  • Variant (B): One change only

Critical rules:

  • Change ONE thing only
  • 50/50 traffic split
  • Random assignment
  • Run until statistical significance
  • Don't peek early (increases false positives)

5. Implementation

Feature flags pattern:

// Simple feature flag
const variant = userId % 2 === 0 ? 'A' : 'B'

// Track variant
trackEvent('experiment_view', {
  experimentId: 'signup_form_length',
  variant: variant
})

// Render appropriate version
if (variant === 'A') {
  return <LongSignupForm />
} else {
  return <ShortSignupForm />
}

Using PostHog (recommended):

import { useFeatureFlagVariant } from 'posthog-js/react'

const variant = useFeatureFlagVariant('signup_form_length')

return variant === 'control'
  ? <LongSignupForm />
  : <ShortSignupForm />

6. Run Experiment

Don't stop until:

  • Statistical significance reached (p < 0.05)
  • Minimum sample size achieved
  • At least 1 week elapsed (account for day-of-week effects)
  • At least 1 full business cycle (if B2B)

Stopping early = false positives

7. Analyze Results

Calculate statistical significance:

Use chi-square test or t-test
p-value < 0.05 = statistically significant

Check for:

  • Practical significance (is lift worth it?)
  • Consistency across segments
  • Impact on guardrail metrics
  • Novelty effects (will it last?)

8. Make Decision

Ship winner if:

  • ✅ Statistically significant (p < 0.05)
  • ✅ Practically significant (meaningful improvement)
  • ✅ No negative impact on guardrails
  • ✅ Consistent across key segments

Keep testing if:

  • ⚠️ Inconclusive results
  • ⚠️ Mixed segment performance
  • ⚠️ Questions about long-term impact

Roll back if:

  • ❌ Negative impact
  • ❌ Worse than control
  • ❌ Breaks guardrail metrics

Common Experiment Types

A/B Test (2 variants)

Use for:

  • Testing one clear alternative
  • Simple yes/no decisions

Example:

  • Control: Long signup form
  • Variant: Short signup form

A/B/C Test (3+ variants)

Use for:

  • Testing multiple alternatives
  • Finding optimal value (e.g., price points)

Example:

  • Control: $9.99/month
  • Variant B: $14.99/month
  • Variant C: $19.99/month

Warning: Needs more traffic (split 3+ ways)

Multivariate Test

Use for:

  • Testing combinations of changes
  • Understanding interaction effects

Example: Test button color × button text:

  • Red + "Sign Up" = Variant A
  • Red + "Get Started" = Variant B
  • Blue + "Sign Up" = Variant C
  • Blue + "Get Started" = Variant D

Warning: Needs 4x traffic of A/B test

Sequential Testing

Use for:

  • Iterative improvements
  • Building on learnings

Pattern:

Week 1: Test headline variations → Winner
Week 2: Test CTA copy (using winning headline) → Winner
Week 3: Test social proof placement (using winners) → Winner

Experiment Design Checklist

Before launching:

  • Clear hypothesis written
  • Primary metric defined
  • Sample size calculated
  • Runtime estimated (can we wait this long?)
  • Implementation plan ready
  • Tracking instrumented
  • Randomization tested
  • No external changes planned (confounding variables)

During experiment:

  • Monitor daily (but don't stop early)
  • Check for bugs/implementation issues
  • Verify equal traffic split
  • Track guardrail metrics

After experiment:

  • Statistical significance calculated
  • Practical significance assessed
  • Segment analysis done
  • Decision documented
  • Learnings recorded

Statistical Concepts

Statistical Significance (p-value)

What it means:

  • p < 0.05: <5% chance result is random
  • p < 0.01: <1% chance (stronger evidence)

Not a guarantee:

  • False positives happen (5% with p < 0.05)
  • Need adequate sample size
  • Multiple tests increase false positive rate

Confidence Intervals

Example: "Variant B improved conversion by 2.3% (95% CI: 1.2% to 3.4%)"

Interpretation:

  • Best estimate: 2.3% improvement
  • 95% confident true value is between 1.2% and 3.4%
  • Lower bound (1.2%) > 0, so we're confident there's an improvement

Sample Size

Factors:

  • Baseline conversion rate
  • Minimum detectable effect (smaller = more samples needed)
  • Statistical significance (95% standard)
  • Statistical power (80% standard)

Rule of thumb: Need ~385 conversions per variant for:

  • 95% significance
  • 80% power
  • Detecting 20% relative improvement

Common Experiment Mistakes

Mistake Impact Prevention
Peeking early False positives Wait for full sample size
Testing multiple things Can't isolate cause One variable per test
No hypothesis Unclear what to measure Write hypothesis first
Sample too small Inconclusive Calculate required size
Stopping too soon Day-of-week effects Run full week minimum
Ignoring guardrails Win metric but hurt business Track all key metrics
Not accounting for seasonality Confounding variables Compare to same period

Experiment Coordination with Agents

Use experiment-tracker agent for:

  • Setting up experiments
  • Tracking results over time
  • Calculating statistical significance
  • Making roll-out decisions

Use analytics-reporter agent for:

  • Understanding baseline metrics
  • Segment analysis
  • Funnel visualization
  • Historical trends

Example workflow:

1. Form hypothesis
2. @experiment-tracker set up A/B test for [hypothesis]
3. @experiment-tracker check results (after 1 week)
4. @analytics-reporter analyze segments for [experiment]
5. Make data-driven decision

Quick Decisions Without Full Experiments

When to skip formal A/B testing:

  • Obvious improvements (fixing bugs, removing friction)
  • Time-sensitive opportunities
  • Very low traffic (<100 users/day)
  • Qualitative improvements
  • Compliance/legal requirements

Use instead:

  • Gradual rollout (5% → 25% → 50% → 100%)
  • Monitor metrics closely
  • Quick revert if issues

Resources

  • Sample size calculator: evanmiller.org/ab-testing/sample-size.html
  • Stats significance calculator: abtestguide.com/calc
  • PostHog (feature flags + experiments): posthog.com

Experimentation is how good products become great products. Test rigorously, decide confidently.