| name | incident-response |
| description | Respond to production incidents systematically with triage, investigation, resolution, and post-mortem analysis to minimize downtime and prevent recurrence. Use when handling production outages, triaging incidents, investigating critical bugs, coordinating incident response, implementing hotfixes, conducting post-mortems, or establishing incident response procedures. |
Incident Response - Production Issue Management
When to use this skill
- Responding to production outages
- Triaging critical incidents
- Investigating high-severity bugs
- Coordinating incident response teams
- Implementing emergency hotfixes
- Conducting post-mortem analyses
- Establishing incident response procedures
- Communicating status during incidents
- Creating runbooks for common issues
- Implementing rollback strategies
- Documenting incident timelines
- Preventing incident recurrence
When to use this skill
- Responding to outages, managing incidents, conducting postmortems.
- When working on related tasks or features
- During development that requires this expertise
Use when: Responding to outages, managing incidents, conducting postmortems.
Incident Response Process
1. Detect
- Monitoring alerts
- User reports
- Automated checks
2. Triage
- Assess severity (P0-P4)
- Page on-call engineer
- Create incident channel
3. Mitigate
- Rollback to last known good
- Scale resources
- Apply hotfix
- Communicate status
4. Resolve
- Verify fix
- Monitor metrics
- Update status page
- Close incident
5. Postmortem
- Timeline of events
- Root cause analysis
- Action items
- Follow-up tasks
Severity Levels
- P0 (Critical): Complete outage, data loss
- P1 (High): Major feature broken, revenue impact
- P2 (Medium): Degraded performance, workaround exists
- P3 (Low): Minor bug, cosmetic issue
- P4 (Informational): Enhancement request
Example Runbook
```markdown
High CPU Usage Runbook
Symptoms
- Server CPU > 90%
- Slow response times
- Request timeouts
Investigation
- Check top processes: `top`
- Check memory: `free -h`
- Check logs: `tail -f app.log`
Mitigation
- Scale horizontally: Add servers
- Restart service: `systemctl restart app`
- Rate limit: Enable aggressive rate limiting
Resolution
- Identify root cause (N+1 query, memory leak, etc.)
- Deploy fix
- Monitor for 1 hour ```
Communication Template
``` [INCIDENT] Service X degraded
Status: Investigating Impact: 20% of users seeing slow load times ETA: 30 minutes
Updates:
- 10:00 AM: Issue detected
- 10:05 AM: On-call paged, investigation started
- 10:15 AM: Root cause identified (database bottleneck)
- 10:30 AM: Fix deployed, monitoring
Next update: 11:00 AM ```