| name | incident-management |
| description | Handle production incidents effectively. Use when responding to outages, conducting post-mortems, or improving reliability. Covers incident response and blameless culture. |
| allowed-tools | Read, Write, Glob, Grep |
Incident Management
Incident Severity
| Level |
Impact |
Response Time |
| SEV1 |
Complete outage |
Immediate |
| SEV2 |
Major degradation |
< 15 min |
| SEV3 |
Minor degradation |
< 1 hour |
| SEV4 |
Low impact |
Next business day |
Incident Response
1. Detect
- Monitoring alerts
- Customer reports
- Error logs
2. Triage
- Assess severity
- Assign incident commander
- Create communication channel
3. Investigate
- Check recent changes
- Review logs and metrics
- Identify root cause
4. Mitigate
- Apply quick fix
- Rollback if needed
- Communicate status
5. Resolve
- Confirm fix
- Monitor for recurrence
- Close incident
6. Learn
- Post-mortem meeting
- Document findings
- Create action items
Post-Mortem Template
# Post-Mortem: [Incident Title]
## Summary
[Brief description of what happened]
## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]
## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]
## Root Cause
[What caused this incident]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## What Went Well
- [Positive 1]
- [Positive 2]
## What Could Be Improved
- [Improvement 1]
- [Improvement 2]
## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]
Blameless Culture
- Focus on systems, not people
- "What failed?" not "Who failed?"
- Share learnings openly
- Celebrate near-misses