| name | incident-response |
| description | Incident management, on-call procedures, and runbook execution. |
Incident Response
Severity Levels
| Level |
Description |
Response Time |
| P1 |
Service down |
15 min |
| P2 |
Major degradation |
30 min |
| P3 |
Minor impact |
4 hours |
| P4 |
No impact |
Next business day |
Incident Flow
Alert → Acknowledge → Assess → Mitigate → Resolve → Postmortem
│ │ │
└── Page ─────┴── Communicate
On-Call Checklist
- Acknowledge alert within SLA
- Assess impact and severity
- Communicate status to stakeholders
- Mitigate - Stop the bleeding
- Investigate root cause
- Resolve underlying issue
- Document in postmortem
Communication Template
🔴 INCIDENT: [Brief description]
Impact: [Who/what is affected]
Status: [Investigating/Mitigating/Resolved]
ETA: [Expected resolution time]
Updates: [Channel/page]
Common Runbooks
High CPU
- Identify process:
top -c
- Check for runaway processes
- Scale horizontally if needed
- Investigate root cause
Out of Disk
- Check usage:
df -h
- Find large files:
du -sh /* | sort -h
- Clear logs/temp files
- Add storage or archive
Database Slow
- Check connections:
SHOW PROCESSLIST
- Identify slow queries
- Kill blocking queries if needed
- Scale or optimize
Escalation Path
On-Call Engineer (15 min)
↓
Team Lead (30 min)
↓
Engineering Manager (1 hour)
↓
VP Engineering (2 hours)