| name | chaos-engineering-principles |
| description | Use when starting chaos engineering, designing fault injection experiments, choosing chaos tools, testing system resilience, or recovering from chaos incidents - provides hypothesis-driven testing, blast radius control, and anti-patterns for safe chaos |
Chaos Engineering Principles
Overview
Core principle: Chaos engineering validates resilience through controlled experiments, not random destruction.
Rule: Start in staging, with monitoring, with rollback, with small blast radius. No exceptions.
When NOT to Do Chaos
Don't run chaos experiments if ANY of these are missing:
- ❌ No comprehensive monitoring (APM, metrics, logs, alerts)
- ❌ No automated rollback capability
- ❌ No baseline metrics documented
- ❌ No incident response team available
- ❌ System already unstable (fix stability first)
- ❌ No staging environment to practice
Fix these prerequisites BEFORE chaos testing.
Tool Selection Decision Tree
| Your Constraint | Choose | Why |
|---|---|---|
| Kubernetes-native, CNCF preference | LitmusChaos | Cloud-native, operator-based, excellent K8s integration |
| Kubernetes-focused, visualization needs | Chaos Mesh | Fine-grained control, dashboards, low overhead |
| Want managed service, quick start | Gremlin | Commercial, guided experiments, built-in best practices |
| Vendor-neutral, maximum flexibility | Chaos Toolkit | Open source, plugin ecosystem, any infrastructure |
| AWS-specific, cost-sensitive | AWS FIS | Native AWS integration, pay-per-experiment |
For most teams: Chaos Toolkit (flexible, free) or Gremlin (fast, managed)
Prerequisites Checklist
Before FIRST experiment:
Monitoring (Required):
- Real-time dashboards for key metrics (latency, error rate, throughput)
- Distributed tracing for request flows
- Log aggregation with timeline correlation
- Alerts configured with thresholds
Rollback (Required):
- Automated rollback based on metrics (e.g., error rate > 5% → abort)
- Manual kill switch everyone can activate
- Rollback tested and documented (< 30 sec recovery)
Baseline (Required):
- Documented normal metrics (P50/P95/P99 latency, error rate %)
- Known dependencies and critical paths
- System architecture diagram
Team (Required):
- Designated observer monitoring experiment
- On-call engineer available
- Communication channel established (war room, Slack)
- Post-experiment debrief scheduled
Anti-Patterns Catalog
❌ Production First Chaos
Symptom: "Let's start chaos testing in production to see what breaks"
Why bad: No practice, no muscle memory, production incidents guaranteed
Fix: Run 5-10 experiments in staging FIRST. Graduate to production only after proving: experiments work as designed, rollback functions, team can execute response
❌ Chaos Without Monitoring
Symptom: "We injected latency but we're not sure what happened"
Why bad: Blind chaos = no learning. You can't validate resilience without seeing system behavior
Fix: Set up comprehensive monitoring BEFORE first experiment. Must be able to answer "What changed?" within 30 seconds
❌ Unlimited Blast Radius
Symptom: Affecting 100% of traffic/all services on first run
Why bad: Cascading failures, actual outages, customer impact
Fix: Start at 0.1-1% traffic. Progression: 0.1% → 1% → 5% → 10% → (stop or 50%). Each step validates before expanding
❌ Chaos Without Rollback
Symptom: "The experiment broke everything and we can't stop it"
Why bad: Chaos becomes real incident, 2+ hour recovery, lost trust
Fix: Automated abort criteria (error rate threshold, latency threshold, manual kill switch). Test rollback before injecting failures
❌ Random Chaos (No Hypothesis)
Symptom: "Let's inject some failures and see what happens"
Why bad: No learning objective, can't validate resilience, wasted time
Fix: Every experiment needs hypothesis: "System will [expected behavior] when [failure injected]"
Failure Types Catalog
Priority order for microservices:
| Failure Type | Priority | Why Test This | Example |
|---|---|---|---|
| Network Latency | HIGH | Most common production issue | 500ms delay service A → B |
| Service Timeout | HIGH | Tests circuit breakers, retry logic | Service B unresponsive |
| Connection Loss | HIGH | Tests failover, graceful degradation | TCP connection drops |
| Resource Exhaustion | MEDIUM | Tests resource limits, scaling | Memory limit, connection pool full |
| Packet Loss | MEDIUM | Tests retry strategies | 1-10% packet loss |
| DNS Failure | MEDIUM | Tests service discovery resilience | DNS resolution delays |
| Cache Failure | MEDIUM | Tests fallback behavior | Redis down |
| Database Errors | LOW (start) | High risk - test after basics work | Connection refused, query timeout |
Start with network latency - safest, most informative, easiest rollback.
Experiment Template
Use this for every chaos experiment:
1. Hypothesis "If [failure injected], system will [expected behavior], and [metric] will remain [threshold]"
Example: "If service-payment experiences 2s latency, circuit breaker will open within 10s, and P99 latency will stay < 500ms"
2. Baseline Metrics
- Current P50/P95/P99 latency:
- Current error rate:
- Current throughput:
3. Experiment Config
- Failure type: [latency / packet loss / service down / etc.]
- Target: [specific service / % of traffic]
- Blast radius: [0.1% traffic, single region, canary pods]
- Duration: [2-5 minutes initial]
- Abort criteria: [error rate > 5% OR P99 > 1s OR manual stop]
4. Execution
- Observer: [name] monitoring dashboards
- Runner: [name] executing experiment
- Kill switch: [procedure]
- Start time: [timestamp]
5. Observation
- What happened vs hypothesis:
- Actual metrics during chaos:
- System behavior notes:
6. Validation
- ✓ Hypothesis validated / ✗ Hypothesis failed
- Unexpected findings:
- Action items:
Blast Radius Progression
Safe scaling path:
| Step | Traffic Affected | Duration | Abort If |
|---|---|---|---|
| 1. Staging | 100% staging | 5 min | Any production impact |
| 2. Canary | 0.1% production | 2 min | Error rate > 1% |
| 3. Small | 1% production | 5 min | Error rate > 2% |
| 4. Medium | 5% production | 10 min | Error rate > 5% |
| 5. Large | 10% production | 15 min | Error rate > 5% |
Never skip steps. Each step validates before expanding.
Stop at 10-20% for most experiments - no need to chaos 100% of production traffic.
Low-traffic services (< 1000 req/day): Use absolute request counts instead of percentages. Minimum 5-10 affected requests per step. Example: 100 req/day service should still start with 5-10 requests (6 hours), not 0.1% (1 request every 10 days).
Your First Experiment (Staging)
Goal: Build confidence, validate monitoring, test rollback
Experiment: Network latency on non-critical service
# Example with Chaos Toolkit
1. Pick least critical service (e.g., recommendation engine, not payment)
2. Inject 500ms latency to 100% of staging traffic
3. Duration: 5 minutes
4. Expected: Timeouts handled gracefully, fallback behavior activates
5. Monitor: Error rate, latency, downstream services
6. Abort if: Error rate > 10% or cascading failures
7. Debrief: What did we learn? Did monitoring catch it? Did rollback work?
Success criteria: You can answer "Did our hypothesis hold?" within 5 minutes of experiment completion.
Common Mistakes
❌ Testing During Incidents
Fix: Only chaos test during stable periods, business hours, with extra staffing
❌ Network Latency Underestimation
Fix: Latency cascades - 500ms can become 5s downstream. Start with 100-200ms, observe, then increase
❌ No Post-Experiment Review
Fix: Every experiment gets 15-min debrief: What worked? What broke? What did we learn?
Quick Reference
Prerequisites Before First Chaos:
- Monitoring + alerts
- Automated rollback
- Baseline metrics documented
- Team coordinated
Experiment Steps:
- Write hypothesis
- Document baseline
- Define blast radius (start 0.1%)
- Set abort criteria
- Execute with observer
- Validate hypothesis
- Debrief team
Blast Radius Progression: Staging → 0.1% → 1% → 5% → 10% (stop for most experiments)
First Experiment: Network latency (500ms) on non-critical service in staging for 5 minutes
Bottom Line
Chaos engineering is hypothesis-driven science, not random destruction.
Start small (staging, 0.1% traffic), with monitoring, with rollback. Graduate slowly.