| name | root-cause-analysis |
| description | Conduct systematic root cause analysis to identify underlying problems. Use structured methodologies to prevent recurring issues and drive improvements. |
Root Cause Analysis
Overview
Root cause analysis (RCA) identifies underlying reasons for failures, enabling permanent solutions rather than temporary fixes.
When to Use
- Production incidents
- Customer-impacting issues
- Repeated problems
- Unexpected failures
- Performance degradation
Instructions
1. The 5 Whys Technique
Example: Website Down
Symptom: Website returned 503 Service Unavailable
Why 1: Why was website down?
Answer: Database connection pool exhausted
Why 2: Why was connection pool exhausted?
Answer: Queries taking too long, connections not released
Why 3: Why were queries slow?
Answer: Missing index on frequently queried column
Why 4: Why was index missing?
Answer: Performance testing didn't use production-like data volume
Why 5: Why wasn't production-like data used?
Answer: Load testing environment doesn't mirror production
Root Cause: Load testing environment under-provisioned
Solution: Update load testing environment with production-like data
Prevention: Establish environment parity requirements
2. Systematic RCA Process
Step 1: Gather Facts
- When did issue occur?
- Who detected it?
- How many users affected?
- What error messages?
- What system changes deployed?
- Check logs, metrics, alerts
- Determine impact scope
Step 2: Reproduce
- Can we reproduce consistently?
- What are the exact steps?
- What environment (prod, staging)?
- Can we isolate to component?
- Set up test case
Step 3: Identify Contributing Factors
- Direct cause
- Indirect/enabling factors
- System vulnerabilities
- Procedural gaps
- Knowledge gaps
Step 4: Determine Root Cause
- Use 5 Whys technique
- Ask "why did this control fail?"
- Look for systemic issues
- Separate root cause from symptoms
Step 5: Develop Solutions
- Immediate: Fix the symptom
- Short-term: Prevent recurrence
- Long-term: Systemic fix
- Prioritize by impact/effort
Step 6: Implement & Verify
- Implement solutions
- Test in staging
- Deploy carefully
- Verify improvement
- Monitor metrics
Step 7: Document & Share
- Write RCA report
- Document lesson learned
- Share with team
- Update procedures
- Training if needed
3. RCA Report Template
RCA Report:
Incident: Database connection failure (2024-01-15, 14:30-15:15)
Impact:
- Duration: 45 minutes
- Users affected: 5,000 (10% of user base)
- Revenue lost: ~$2,000
- Severity: P1 (Critical)
Timeline:
14:30: Automated monitoring alert: High error rate (20%)
14:32: On-call engineer notified
14:35: Identified database connection error in logs
14:40: Restarted database connection pool
14:42: Service recovered, error rate returned to 0.1%
14:50: Incident declared resolved
15:15: Full recovery verified
Root Cause:
Poorly optimized query introduced in release 2.5.0 caused
queries to take 10x longer. Connection pool exhausted as
connections weren't released quickly.
Contributing Factors:
1. No query performance testing pre-deployment
2. Load testing environment doesn't match production volume
3. No alerting on query duration
4. Connection pool timeout set too high
Solutions:
Immediate (Done):
- Rolled back problematic query optimization
Short-term (1 week):
- Added query performance alerts (>1s)
- Added index for slow query
- Set query timeout to 5 seconds
Long-term (1 month):
- Updated load testing with production-like data
- Implement performance benchmarks in CI/CD
- Improve monitoring for connection pool health
- Training on query optimization
Prevention:
- Query performance regression tests
- Load testing with production data
- Connection pool metrics monitoring
- Code review of database changes
4. Root Cause Analysis Techniques
Fishbone Diagram:
Main problem: Slow API Response
Branches:
Code:
- Inefficient algorithm
- Missing cache
- Unnecessary queries
Data:
- Large dataset
- Missing index
- Slow database
Infrastructure:
- Low CPU capacity
- Slow network
- Disk I/O bottleneck
Process:
- No monitoring
- No load testing
- Manual deployments
People:
- Lack of knowledge
- Lack of tools
- No peer review
---
Systemic vs. Individual Causes:
Individual: "Developer used inefficient code"
Fix: Training
Risk: Happens again with different person
Systemic: "No code review process"
Fix: Implement mandatory code review
Risk: Prevents similar issues
Prefer systemic solutions for prevention
5. Follow-Up & Prevention
After RCA:
1. Track Action Items
- Assign owner
- Set deadline
- Follow up in retrospective
2. Prevent Recurrence
- Automated tests
- Monitoring/alerts
- Procedural changes
- Training
3. Monitor Metrics
- Track similar incidents
- Verify fix effectiveness
- Monitor preventive measures
- Catch early warnings
4. Share Learnings
- Document incident
- Share with team
- Industry sharing if relevant
- Update procedures
---
Checklist:
[ ] Incident details documented
[ ] Timeline established
[ ] Logs reviewed
[ ] Metrics analyzed
[ ] Root cause identified (via 5 Whys)
[ ] Contributing factors listed
[ ] Immediate actions completed
[ ] Short-term solutions planned
[ ] Long-term solutions identified
[ ] Solutions prioritized
[ ] RCA report written
[ ] Team debriefing scheduled
[ ] Action items assigned
[ ] Prevention measures planned
[ ] Follow-up scheduled
Key Points
- Distinguish symptom from root cause
- Use 5 Whys technique systematically
- Look for systemic issues, not individual blame
- Focus on prevention, not just fixing
- Document thoroughly for team learning
- Assign clear ownership for solutions
- Follow up to verify effectiveness
- Use RCA to drive improvements