Claude Code Plugins

Community-maintained marketplace

Feedback

testing-in-production

@tachyon-beep/skillpacks
4
0

Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name testing-in-production
description Use when implementing feature flags, canary deployments, shadow traffic, A/B testing, choosing blast radius limits, defining rollback criteria, or monitoring production experiments - provides technique selection, anti-patterns, and kill switch frameworks

Testing in Production

Overview

Core principle: Minimize blast radius, maximize observability, always have a kill switch.

Rule: Testing in production is safe when you control exposure and can roll back instantly.

Regulated industries (healthcare, finance, government): Production testing is still possible but requires additional controls - compliance review before experiments, audit trails for flag changes, avoiding PHI/PII in logs, Business Associate Agreements for third-party tools, and restricted techniques (shadow traffic may create prohibited data copies). Consult compliance team before first production test.

Technique Selection Decision Tree

Your Goal Risk Tolerance Infrastructure Needed Use
Test feature with specific users Low Feature flag service Feature Flags
Validate deployment safety Medium Load balancer, multiple instances Canary Deployment
Compare old vs new performance Low Traffic duplication Shadow Traffic
Measure business impact Medium A/B testing framework, analytics A/B Testing
Test without any user impact Lowest Service mesh, traffic mirroring Dark Launch

First technique: Feature flags (lowest infrastructure requirement, highest control)

Anti-Patterns Catalog

❌ Nested Feature Flags

Symptom: Flags controlling other flags, creating combinatorial complexity

Why bad: 2^N combinations to test, impossible to validate all paths, technical debt accumulates

Fix: Maximum 1 level of flag nesting, delete flags after rollout

# ❌ Bad
if feature_flags.enabled("new_checkout"):
    if feature_flags.enabled("express_shipping"):
        if feature_flags.enabled("gift_wrap"):
            # 8 possible combinations for 3 flags

# ✅ Good
if feature_flags.enabled("new_checkout_v2"):  # Single flag for full feature
    return new_checkout_with_all_options()

❌ Canary with Sticky Sessions

Symptom: Users switch between old and new versions across requests due to session affinity

Why bad: Inconsistent experience, state corruption, false negative metrics

Fix: Route user to same version for entire session

# ✅ Good - Consistent routing
upstream backend {
    hash $cookie_user_id consistent;  # Sticky by user ID
    server backend-v1:8080 weight=95;
    server backend-v2:8080 weight=5;
}

❌ No Statistical Validation

Symptom: Making rollout decisions on small sample sizes without confidence intervals

Why bad: Random variance mistaken for real effects, premature rollback or expansion

Fix: Minimum sample size, statistical significance testing

# ✅ Good - Statistical validation
from scipy import stats

def is_safe_to_rollout(control_errors, treatment_errors, min_sample=1000):
    if len(treatment_errors) < min_sample:
        return False, "Insufficient data"

    # Two-proportion z-test
    _, p_value = stats.proportions_ztest(
        [control_errors.sum(), treatment_errors.sum()],
        [len(control_errors), len(treatment_errors)]
    )

    return p_value > 0.05, f"p-value: {p_value}"

❌ Testing Without Rollback

Symptom: Deploying feature flags or canaries without instant kill switch

Why bad: When issues detected, can't stop impact immediately

Fix: Kill switch tested before first production test


❌ Insufficient Monitoring

Symptom: Monitoring only error rates, missing business/user metrics

Why bad: Technical success but business failure (e.g., lower conversion)

Fix: Monitor technical + business + user experience metrics

Blast Radius Control Framework

Progressive rollout schedule:

Phase Exposure Duration Abort If Continue If
1. Internal 10-50 internal users 1-2 days Any errors 0 errors, good UX feedback
2. Canary 1% production traffic 4-24 hours Error rate > +2%, latency > +10% Metrics stable
3. Small 5% production 1-2 days Error rate > +5%, latency > +25% Metrics stable or improved
4. Medium 25% production 2-3 days Error rate > +5%, latency > +25% Metrics stable or improved
5. Majority 50% production 3-7 days Error rate > +5%, business metrics down Metrics improved
6. Full 100% production Monitor indefinitely Business metrics drop Cleanup old code

Minimum dwell time: Each phase needs minimum observation period to catch delayed issues

Rollback at any phase: If metrics degrade, revert to previous phase

Kill Switch Criteria

Immediate rollback triggers (automated):

Metric Threshold Why
Error rate increase > 5% above baseline User impact
p99 latency increase > 50% above baseline Performance degradation
Critical errors (5xx) > 0.1% of requests Service failure
Business metric drop > 10% (conversion, revenue) Revenue impact

Warning triggers (manual investigation):

Metric Threshold Action
Error rate increase 2-5% above baseline Halt rollout, investigate
p95 latency increase 25-50% above baseline Monitor closely
User complaints >3 similar reports Halt rollout, investigate

Statistical validation:

# Sample size for 95% confidence, 80% power
# Minimum 1000 samples per variant for most A/B tests
# For low-traffic features: wait 24-48 hours regardless

Monitoring Quick Reference

Required metrics (all tests):

Category Metrics Alert Threshold
Errors Error rate, exception count, 5xx responses > +5% vs baseline
Performance p50/p95/p99 latency, request duration p99 > +50% vs baseline
Business Conversion rate, transaction completion, revenue > -10% vs baseline
User Experience Client errors, page load, bounce rate > +20% vs baseline

Baseline calculation:

# Collect baseline from previous 7-14 days
baseline_p99 = np.percentile(historical_latencies, 99)
current_p99 = np.percentile(current_latencies, 99)

if current_p99 > baseline_p99 * 1.5:  # 50% increase
    rollback()

Implementation Patterns

Feature Flags Pattern

# Using LaunchDarkly, Split.io, or similar
from launchdarkly import LDClient, Context

client = LDClient("sdk-key")

def handle_request(user_id):
    context = Context.builder(user_id).build()

    if client.variation("new-checkout", context, default=False):
        return new_checkout_flow(user_id)
    else:
        return old_checkout_flow(user_id)

Best practices:

  • Default to False (old behavior) for safety
  • Pass user context for targeting
  • Log flag evaluations for debugging
  • Delete flags within 30 days of full rollout

Canary Deployment Pattern

# Kubernetes with Istio
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-service
spec:
  hosts:
  - my-service
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: my-service
        subset: v2
  - route:
    - destination:
        host: my-service
        subset: v1
      weight: 95
    - destination:
        host: my-service
        subset: v2
      weight: 5

Shadow Traffic Pattern

# Duplicate requests to new service, ignore responses
import asyncio

async def handle_request(request):
    # Primary: serve user from old service
    response = await old_service(request)

    # Shadow: send to new service, don't wait
    asyncio.create_task(new_service(request.copy()))  # Fire and forget

    return response  # User sees old service response

Tool Ecosystem Quick Reference

Tool Category Options When to Use
Feature Flags LaunchDarkly, Split.io, Flagsmith, Unleash User-level targeting, instant rollback
Canary/Blue-Green Istio, Linkerd, AWS App Mesh, Flagger Service mesh, traffic shifting
A/B Testing Optimizely, VWO, Google Optimize Business metric validation
Observability DataDog, New Relic, Honeycomb, Grafana Metrics, traces, logs correlation
Statistical Analysis Statsig, Eppo, GrowthBook Automated significance testing

Recommendation for starting: Feature flags (Flagsmith for self-hosted, LaunchDarkly for SaaS) + existing observability

Your First Production Test

Goal: Safely test a small feature with feature flags

Week 1: Setup

  1. Choose feature flag tool

    • Self-hosted: Flagsmith (free, open source)
    • SaaS: LaunchDarkly (free tier: 1000 MAU)
  2. Instrument code

    if feature_flags.enabled("my-first-test", user_id):
        return new_feature(user_id)
    else:
        return old_feature(user_id)
    
  3. Set up monitoring

    • Error rate dashboard
    • Latency percentiles (p50, p95, p99)
    • Business metric (conversion, completion rate)
  4. Define rollback criteria

    • Error rate > +5%
    • p99 latency > +50%
    • Business metric < -10%

Week 2: Test Execution

Day 1-2: Internal users (10 people)

  • Enable flag for 10 employee user IDs
  • Monitor for errors, gather feedback

Day 3-5: Canary (1% of users)

  • Enable for 1% random sample
  • Monitor metrics every hour
  • Rollback if any threshold exceeded

Day 6-8: Small rollout (5%)

  • If canary successful, increase to 5%
  • Continue monitoring

Day 9-14: Full rollout (100%)

  • Gradual increase: 25% → 50% → 100%
  • Monitor for 7 days at 100%

Week 3: Cleanup

  • Remove flag from code
  • Archive flag in dashboard
  • Document learnings

Common Mistakes

❌ Expanding Rollout Too Fast

Fix: Follow minimum dwell times (24 hours per phase)


❌ Monitoring Only After Issues

Fix: Dashboard ready before first rollout, alerts configured


❌ No Rollback Practice

Fix: Test rollback in staging before production


❌ Ignoring Business Metrics

Fix: Technical metrics AND business metrics required for go/no-go decisions

Quick Reference

Technique Selection:

  • User-specific: Feature flags
  • Deployment safety: Canary
  • Performance comparison: Shadow traffic
  • Business validation: A/B testing

Blast Radius Progression: Internal → 1% → 5% → 25% → 50% → 100%

Kill Switch Thresholds:

  • Error rate: > +5%
  • p99 latency: > +50%
  • Business metrics: > -10%

Minimum Sample Sizes:

  • A/B test: 1000 samples per variant
  • Canary: 24 hours observation

Tool Recommendations:

  • Feature flags: LaunchDarkly, Flagsmith
  • Canary: Istio, Flagger
  • Observability: DataDog, Grafana

Bottom Line

Production testing is safe with three controls: exposure limits, observability, instant rollback.

Start with feature flags, use progressive rollout (1% → 5% → 25% → 100%), monitor technical + business metrics, and always have a kill switch.