name	testing-in-production
description	Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks

Testing in Production

Overview

Core principle: Minimize blast radius, maximize observability, always have a kill switch.

Rule: Testing in production is safe when you control exposure and can roll back instantly.

Regulated industries (healthcare, finance, government): Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.

Technique Selection Decision Tree

Your Goal	Risk Tolerance	Infrastructure Needed	Use
Test feature with specific users	Low	Feature flag service	Feature Flags
Validate deployment safety	Medium	Load balancer, multiple instances	Canary Deployment
Compare old vs new performance	Low	Traffic duplication	Shadow Traffic
Measure business impact	Medium	A/B testing framework, analytics	A/B Testing
Test without any user impact	Lowest	Service mesh, traffic mirroring	Dark Launch

First technique: Feature flags (lowest infrastructure requirement, highest control)

Anti-Patterns Catalog

❌ Nested Feature Flags

Symptom: Flags controlling other flags, creating combinatorial complexity

Why bad: 2^N combinations to test, impossible to validate all paths, technical debt accumulates

Fix: Maximum 1 level of flag nesting, delete flags after rollout

# ❌ Bad
if feature_flags.enabled("new_checkout"):
    if feature_flags.enabled("express_shipping"):
        if feature_flags.enabled("gift_wrap"):
            # 8 possible combinations for 3 flags

# ✅ Good
if feature_flags.enabled("new_checkout_v2"):  # Single flag for full feature
    return new_checkout_with_all_options()

❌ Canary with Sticky Sessions

Symptom: Users switch between old and new versions across requests due to session affinity

Why bad: Inconsistent experience, state corruption, false negative metrics

Fix: Route user to same version for entire session

# ✅ Good - Consistent routing
upstream backend {
    hash $cookie_user_id consistent;  # Sticky by user ID
    server backend-v1:8080 weight=95;
    server backend-v2:8080 weight=5;
}

❌ No Statistical Validation

Symptom: Making rollout decisions on small sample sizes without confidence intervals

Why bad: Random variance mistaken for real effects, premature rollback or expansion

Fix: Minimum sample size, statistical significance testing

# ✅ Good - Statistical validation
from scipy import stats

def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
    if len(treatment_errors) < min_sample:
        return False, "Insufficient data"

    # Two-proportion z-test
    _, p_value = stats.proportions_ztest(
        [control_errors.sum(), treatment_errors.sum()],
        [len(control_errors), len(treatment_errors)]
    )

    return p_value > 0.05, f"p-value: {p_value}"

❌ Testing Without Rollback

Symptom: Deploying feature flags or canaries without instant kill switch

Why bad: When issues detected, can't stop impact immediately

Fix: Kill switch tested before first production test

❌ Insufficient Monitoring

Symptom: Monitoring only error rates, missing business/user metrics

Why bad: Technical success but business failure (e.g., lower conversion)

Fix: Monitor technical + business + user experience metrics

Blast Radius Control Framework

Progressive rollout schedule:

Phase	Exposure	Duration	Abort If	Continue If
1. Internal	10-50 internal users	1-2 days	Any errors	0 errors, good UX feedback
2. Canary	1% production traffic	4-24 hours	Error rate > +2%, latency > +10%	Metrics stable
3. Small	5% production	1-2 days	Error rate > +5%, latency > +25%	Metrics stable or improved
4. Medium	25% production	2-3 days	Error rate > +5%, latency > +25%	Metrics stable or improved
5. Majority	50% production	3-7 days	Error rate > +5%, business metrics down	Metrics improved
6. Full	100% production	Monitor indefinitely	Business metrics drop	Cleanup old code

Minimum dwell time: Each phase needs minimum observation period to catch delayed issues

Rollback at any phase: If metrics degrade, revert to previous phase

Kill Switch Criteria

Immediate rollback triggers (automated):

Metric	Threshold	Why
Error rate increase	> 5% above baseline	User impact
p99 latency increase	> 50% above baseline	Performance degradation
Critical errors (5xx)	> 0.1% of requests	Service failure
Business metric drop	> 10% (conversion, revenue)	Revenue impact

Warning triggers (manual investigation):

Metric	Threshold	Action
Error rate increase	2-5% above baseline	Halt rollout, investigate
p95 latency increase	25-50% above baseline	Monitor closely
User complaints	>3 similar reports	Halt rollout, investigate

Statistical validation:

# Sample size for 95% confidence, 80% power
# Minimum 1000 samples per variant for most A/B tests
# For low-traffic features: wait 24-48 hours regardless

Monitoring Quick Reference

Required metrics (all tests):

Category	Metrics	Alert Threshold
Errors	Error rate, exception count, 5xx responses	> +5% vs baseline
Performance	p50/p95/p99 latency, request duration	p99 > +50% vs baseline
Business	Conversion rate, transaction completion, revenue	> -10% vs baseline
User Experience	Client errors, page load, bounce rate	> +20% vs baseline

Baseline calculation:

# Collect baseline from previous 7-14 days
baseline_p99 = np.percentile(historical_latencies, 99)
current_p99 = np.percentile(current_latencies, 99)

if current_p99 > baseline_p99 * 1.5:  # 50% increase
    rollback()

Implementation Patterns

Feature Flags Pattern

# Using LaunchDarkly, Split.io, or similar
from launchdarkly import LDClient, Context

client = LDClient("sdk-key")

def handle_request(user_id):
    context = Context.builder(user_id).build()

    if client.variation("new-checkout", context, default=False):
        return new_checkout_flow(user_id)
    else:
        return old_checkout_flow(user_id)

Best practices:

Default to False (old behavior) for safety
Pass user context for targeting
Log flag evaluations for debugging
Delete flags within 30 days of full rollout

Canary Deployment Pattern

# Kubernetes with Istio
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: my-service
        subset: v2
  - route:
    - destination:
        host: my-service
        subset: v1
      weight: 95
    - destination:
        host: my-service
        subset: v2
      weight: 5

Shadow Traffic Pattern

# Duplicate requests to new service, ignore responses
import asyncio

async def handle_request(request):
    # Primary: serve user from old service
    response = await old_service(request)

    # Shadow: send to new service, don't wait
    asyncio.create_task(new_service(request.copy()))  # Fire and forget

    return response  # User sees old service response

Tool Ecosystem Quick Reference

Tool Category	Options	When to Use
Feature Flags	LaunchDarkly, Split.io, Flagsmith, Unleash	User-level targeting, instant rollback
Canary/Blue-Green	Istio, Linkerd, AWS App Mesh, Flagger	Service mesh, traffic shifting
A/B Testing	Optimizely, VWO, Google Optimize	Business metric validation
Observability	DataDog, New Relic, Honeycomb, Grafana	Metrics, traces, logs correlation
Statistical Analysis	Statsig, Eppo, GrowthBook	Automated significance testing

Recommendation for starting: Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability

Your First Production Test

Goal: Safely test a small feature with feature flags

Week 1: Setup

Choose feature flag tool
- Self-hosted: Flagsmith (free, open source)
- SaaS: LaunchDarkly (free tier: 1000 MAU)

Instrument code

if feature_flags.enabled("my-first-test", user_id):
    return new_feature(user_id)
else:
    return old_feature(user_id)

Set up monitoring
- Error rate dashboard
- Latency percentiles (p50, p95, p99)
- Business metric (conversion, completion rate)
Define rollback criteria
- Error rate > +5%
- p99 latency > +50%
- Business metric < -10%

Week 2: Test Execution

Day 1-2: Internal users (10 people)

Enable flag for 10 employee user IDs
Monitor for errors, gather feedback

Day 3-5: Canary (1% of users)

Enable for 1% random sample
Monitor metrics every hour
Rollback if any threshold exceeded

Day 6-8: Small rollout (5%)

If canary successful, increase to 5%
Continue monitoring

Day 9-14: Full rollout (100%)

Gradual increase: 25% → 50% → 100%
Monitor for 7 days at 100%

Week 3: Cleanup

Remove flag from code
Archive flag in dashboard
Document learnings

Common Mistakes

❌ Expanding Rollout Too Fast

Fix: Follow minimum dwell times (24 hours per phase)

❌ Monitoring Only After Issues

Fix: Dashboard ready before first rollout, alerts configured

❌ No Rollback Practice

Fix: Test rollback in staging before production

❌ Ignoring Business Metrics

Fix: Technical metrics AND business metrics required for go/no-go decisions

Quick Reference

Technique Selection:

User-specific: Feature flags
Deployment safety: Canary
Performance comparison: Shadow traffic
Business validation: A/B testing

Blast Radius Progression: Internal → 1% → 5% → 25% → 50% → 100%

Kill Switch Thresholds:

Error rate: > +5%
p99 latency: > +50%
Business metrics: > -10%

Minimum Sample Sizes:

A/B test: 1000 samples per variant
Canary: 24 hours observation

Tool Recommendations:

Feature flags: LaunchDarkly, Flagsmith
Canary: Istio, Flagger
Observability: DataDog, Grafana

Bottom Line

Production testing is safe with three controls: exposure limits, observability, instant rollback.

Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.

testing-in-production

Install Skill

SKILL.md