| name | chaos-engineering |
| description | Test system resilience through controlled failures. Use when validating fault tolerance, disaster recovery, or system reliability. Covers chaos experiments. |
| allowed-tools | Read, Write, Bash, Glob, Grep |
Chaos Engineering
Principles
- Build a Hypothesis: Define expected behavior
- Minimize Blast Radius: Start small
- Run in Production: Real conditions matter
- Automate: Make experiments repeatable
- Minimize Impact: Have abort conditions
Experiment Process
- Steady State: Define normal metrics
- Hypothesis: "System will maintain X under condition Y"
- Introduce Variables: Inject failure
- Observe: Compare to steady state
- Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
# Add latency
tc qdisc add dev eth0 root netem delay 100ms
# Packet loss
tc qdisc add dev eth0 root netem loss 10%
# Remove
tc qdisc del dev eth0 root
Resource Exhaustion
# CPU stress
stress --cpu 4 --timeout 60s
# Memory stress
stress --vm 2 --vm-bytes 1G --timeout 60s
# Disk fill
dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
- Kill processes
- Restart containers
- Terminate instances
- Block dependencies
Chaos Tools
- Chaos Monkey: Random instance termination
- Gremlin: Comprehensive chaos platform
- Litmus: Kubernetes chaos engineering
- Chaos Mesh: Cloud-native chaos
Experiment Template
## Experiment: [Name]
### Hypothesis
If [condition], then [expected behavior].
### Steady State
- Metric A: [baseline value]
- Metric B: [baseline value]
### Method
1. [Step 1]
2. [Step 2]
3. [Step 3]
### Abort Conditions
- If [condition], stop immediately
### Results
[What happened]
### Findings
[What we learned]
Safety Rules
- Start in non-production
- Have rollback ready
- Monitor continuously
- Communicate with team
- Document everything