| name | testing-in-production |
| description | Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks |
Testing in Production
Overview
Core principle: Minimize blast radius, maximize observability, always have a kill switch.
Rule: Testing in production is safe when you control exposure and can roll back instantly.
Regulated industries (healthcare, finance, government): Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.
Technique Selection Decision Tree
| Your Goal | Risk Tolerance | Infrastructure Needed | Use |
|---|---|---|---|
| Test feature with specific users | Low | Feature flag service | Feature Flags |
| Validate deployment safety | Medium | Load balancer, multiple instances | Canary Deployment |
| Compare old vs new performance | Low | Traffic duplication | Shadow Traffic |
| Measure business impact | Medium | A/B testing framework, analytics | A/B Testing |
| Test without any user impact | Lowest | Service mesh, traffic mirroring | Dark Launch |
First technique: Feature flags (lowest infrastructure requirement, highest control)
Anti-Patterns Catalog
❌ Nested Feature Flags
Symptom: Flags controlling other flags, creating combinatorial complexity
Why bad: 2^N combinations to test, impossible to validate all paths, technical debt accumulates
Fix: Maximum 1 level of flag nesting, delete flags after rollout
# ❌ Bad
if feature_flags.enabled("new_checkout"):
if feature_flags.enabled("express_shipping"):
if feature_flags.enabled("gift_wrap"):
# 8 possible combinations for 3 flags
# ✅ Good
if feature_flags.enabled("new_checkout_v2"): # Single flag for full feature
return new_checkout_with_all_options()
❌ Canary with Sticky Sessions
Symptom: Users switch between old and new versions across requests due to session affinity
Why bad: Inconsistent experience, state corruption, false negative metrics
Fix: Route user to same version for entire session
# ✅ Good - Consistent routing
upstream backend {
hash $cookie_user_id consistent; # Sticky by user ID
server backend-v1:8080 weight=95;
server backend-v2:8080 weight=5;
}
❌ No Statistical Validation
Symptom: Making rollout decisions on small sample sizes without confidence intervals
Why bad: Random variance mistaken for real effects, premature rollback or expansion
Fix: Minimum sample size, statistical significance testing
# ✅ Good - Statistical validation
from scipy import stats
def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
if len(treatment_errors) < min_sample:
return False, "Insufficient data"
# Two-proportion z-test
_, p_value = stats.proportions_ztest(
[control_errors.sum(), treatment_errors.sum()],
[len(control_errors), len(treatment_errors)]
)
return p_value > 0.05, f"p-value: {p_value}"
❌ Testing Without Rollback
Symptom: Deploying feature flags or canaries without instant kill switch
Why bad: When issues detected, can't stop impact immediately
Fix: Kill switch tested before first production test
❌ Insufficient Monitoring
Symptom: Monitoring only error rates, missing business/user metrics
Why bad: Technical success but business failure (e.g., lower conversion)
Fix: Monitor technical + business + user experience metrics
Blast Radius Control Framework
Progressive rollout schedule:
| Phase | Exposure | Duration | Abort If | Continue If |
|---|---|---|---|---|
| 1. Internal | 10-50 internal users | 1-2 days | Any errors | 0 errors, good UX feedback |
| 2. Canary | 1% production traffic | 4-24 hours | Error rate > +2%, latency > +10% | Metrics stable |
| 3. Small | 5% production | 1-2 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
| 4. Medium | 25% production | 2-3 days | Error rate > +5%, latency > +25% | Metrics stable or improved |
| 5. Majority | 50% production | 3-7 days | Error rate > +5%, business metrics down | Metrics improved |
| 6. Full | 100% production | Monitor indefinitely | Business metrics drop | Cleanup old code |
Minimum dwell time: Each phase needs minimum observation period to catch delayed issues
Rollback at any phase: If metrics degrade, revert to previous phase
Kill Switch Criteria
Immediate rollback triggers (automated):
| Metric | Threshold | Why |
|---|---|---|
| Error rate increase | > 5% above baseline | User impact |
| p99 latency increase | > 50% above baseline | Performance degradation |
| Critical errors (5xx) | > 0.1% of requests | Service failure |
| Business metric drop | > 10% (conversion, revenue) | Revenue impact |
Warning triggers (manual investigation):
| Metric | Threshold | Action |
|---|---|---|
| Error rate increase | 2-5% above baseline | Halt rollout, investigate |
| p95 latency increase | 25-50% above baseline | Monitor closely |
| User complaints | >3 similar reports | Halt rollout, investigate |
Statistical validation:
# Sample size for 95% confidence, 80% power
# Minimum 1000 samples per variant for most A/B tests
# For low-traffic features: wait 24-48 hours regardless
Monitoring Quick Reference
Required metrics (all tests):
| Category | Metrics | Alert Threshold |
|---|---|---|
| Errors | Error rate, exception count, 5xx responses | > +5% vs baseline |
| Performance | p50/p95/p99 latency, request duration | p99 > +50% vs baseline |
| Business | Conversion rate, transaction completion, revenue | > -10% vs baseline |
| User Experience | Client errors, page load, bounce rate | > +20% vs baseline |
Baseline calculation:
# Collect baseline from previous 7-14 days
baseline_p99 = np.percentile(historical_latencies, 99)
current_p99 = np.percentile(current_latencies, 99)
if current_p99 > baseline_p99 * 1.5: # 50% increase
rollback()
Implementation Patterns
Feature Flags Pattern
# Using LaunchDarkly, Split.io, or similar
from launchdarkly import LDClient, Context
client = LDClient("sdk-key")
def handle_request(user_id):
context = Context.builder(user_id).build()
if client.variation("new-checkout", context, default=False):
return new_checkout_flow(user_id)
else:
return old_checkout_flow(user_id)
Best practices:
- Default to
False(old behavior) for safety - Pass user context for targeting
- Log flag evaluations for debugging
- Delete flags within 30 days of full rollout
Canary Deployment Pattern
# Kubernetes with Istio
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-service
spec:
hosts:
- my-service
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: my-service
subset: v2
- route:
- destination:
host: my-service
subset: v1
weight: 95
- destination:
host: my-service
subset: v2
weight: 5
Shadow Traffic Pattern
# Duplicate requests to new service, ignore responses
import asyncio
async def handle_request(request):
# Primary: serve user from old service
response = await old_service(request)
# Shadow: send to new service, don't wait
asyncio.create_task(new_service(request.copy())) # Fire and forget
return response # User sees old service response
Tool Ecosystem Quick Reference
| Tool Category | Options | When to Use |
|---|---|---|
| Feature Flags | LaunchDarkly, Split.io, Flagsmith, Unleash | User-level targeting, instant rollback |
| Canary/Blue-Green | Istio, Linkerd, AWS App Mesh, Flagger | Service mesh, traffic shifting |
| A/B Testing | Optimizely, VWO, Google Optimize | Business metric validation |
| Observability | DataDog, New Relic, Honeycomb, Grafana | Metrics, traces, logs correlation |
| Statistical Analysis | Statsig, Eppo, GrowthBook | Automated significance testing |
Recommendation for starting: Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability
Your First Production Test
Goal: Safely test a small feature with feature flags
Week 1: Setup
Choose feature flag tool
- Self-hosted: Flagsmith (free, open source)
- SaaS: LaunchDarkly (free tier: 1000 MAU)
Instrument code
if feature_flags.enabled("my-first-test", user_id): return new_feature(user_id) else: return old_feature(user_id)Set up monitoring
- Error rate dashboard
- Latency percentiles (p50, p95, p99)
- Business metric (conversion, completion rate)
Define rollback criteria
- Error rate > +5%
- p99 latency > +50%
- Business metric < -10%
Week 2: Test Execution
Day 1-2: Internal users (10 people)
- Enable flag for 10 employee user IDs
- Monitor for errors, gather feedback
Day 3-5: Canary (1% of users)
- Enable for 1% random sample
- Monitor metrics every hour
- Rollback if any threshold exceeded
Day 6-8: Small rollout (5%)
- If canary successful, increase to 5%
- Continue monitoring
Day 9-14: Full rollout (100%)
- Gradual increase: 25% → 50% → 100%
- Monitor for 7 days at 100%
Week 3: Cleanup
- Remove flag from code
- Archive flag in dashboard
- Document learnings
Common Mistakes
❌ Expanding Rollout Too Fast
Fix: Follow minimum dwell times (24 hours per phase)
❌ Monitoring Only After Issues
Fix: Dashboard ready before first rollout, alerts configured
❌ No Rollback Practice
Fix: Test rollback in staging before production
❌ Ignoring Business Metrics
Fix: Technical metrics AND business metrics required for go/no-go decisions
Quick Reference
Technique Selection:
- User-specific: Feature flags
- Deployment safety: Canary
- Performance comparison: Shadow traffic
- Business validation: A/B testing
Blast Radius Progression: Internal → 1% → 5% → 25% → 50% → 100%
Kill Switch Thresholds:
- Error rate: > +5%
- p99 latency: > +50%
- Business metrics: > -10%
Minimum Sample Sizes:
- A/B test: 1000 samples per variant
- Canary: 24 hours observation
Tool Recommendations:
- Feature flags: LaunchDarkly, Flagsmith
- Canary: Istio, Flagger
- Observability: DataDog, Grafana
Bottom Line
Production testing is safe with three controls: exposure limits, observability, instant rollback.
Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.