name	incident-management
description	Handle production incidents effectively. Use when responding to outages, conducting post-mortems, or improving reliability. Covers incident response and blameless culture.
allowed-tools	Read, Write, Glob, Grep

Incident Management

Name: incident-management
Author: dralgorhythm

Incident Severity

Level	Impact	Response Time
SEV1	Complete outage	Immediate
SEV2	Major degradation	< 15 min
SEV3	Minor degradation	< 1 hour
SEV4	Low impact	Next business day

Incident Response

1. Detect

Monitoring alerts
Customer reports
Error logs

2. Triage

Assess severity
Assign incident commander
Create communication channel

3. Investigate

Check recent changes
Review logs and metrics
Identify root cause

4. Mitigate

Apply quick fix
Rollback if needed
Communicate status

5. Resolve

Confirm fix
Monitor for recurrence
Close incident

6. Learn

Post-mortem meeting
Document findings
Create action items

Post-Mortem Template

# Post-Mortem: [Incident Title]

## Summary
[Brief description of what happened]

## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]

## Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]

## Root Cause
[What caused this incident]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## What Went Well
- [Positive 1]
- [Positive 2]

## What Could Be Improved
- [Improvement 1]
- [Improvement 2]

## Action Items
- [ ] [Action 1] - Owner: [Name]
- [ ] [Action 2] - Owner: [Name]

Blameless Culture

Focus on systems, not people
"What failed?" not "Who failed?"
Share learnings openly
Celebrate near-misses

incident-management

Install Skill

SKILL.md

Incident Management

Incident Severity

Incident Response

1. Detect

2. Triage

3. Investigate

4. Mitigate

5. Resolve

6. Learn

Post-Mortem Template

Blameless Culture