| name | Incident Response |
| description | Structured approach to handling production incidents, from detection through resolution and post-mortem analysis |
Incident Response
Purpose
Effective incident response through:
- Rapid incident detection
- Structured response process
- Clear communication
- Root cause analysis
- Prevention strategies
When to Use
Invoke this skill when:
- Production outage occurs
- Performance degradation detected
- Security incident suspected
- Preparing incident runbooks
- Conducting post-mortems
Incident Response Framework
Incident Severity Levels
SEV 1 (Critical):
- Complete service outage
- Data loss/corruption
- Security breach
- Revenue impact: High
Response Time: Immediate
Team: Full on-call rotation
SEV 2 (High):
- Partial service degradation
- Major feature broken
- Affecting multiple customers
- Revenue impact: Medium
Response Time: <15 minutes
Team: Primary on-call + specialist
SEV 3 (Medium):
- Minor feature issue
- Affecting some users
- Workaround available
- Revenue impact: Low
Response Time: <1 hour
Team: Primary on-call
SEV 4 (Low):
- Cosmetic issues
- Single user affected
- No revenue impact
Response Time: Next business day
Team: During business hours
Incident Response Process
1. Detection (Minutes 0-5)
How Incidents Are Detected:
- Automated alerts (monitoring)
- User reports (support tickets)
- Social media mentions
- Team members noticing issues
- Deployment gone wrong
Initial Assessment:
interface Incident {
severity: 'SEV1' | 'SEV2' | 'SEV3' | 'SEV4';
description: string;
affectedSystems: string[];
userImpact: string;
startTime: Date;
}
function assessIncident(alert: Alert): Incident {
// Determine severity
const severity = calculateSeverity({
usersAffected: alert.affectedUsers,
systemsDown: alert.failingSystems,
revenueImpact: alert.estimatedRevenueLoss,
});
return {
severity,
description: alert.message,
affectedSystems: alert.systems,
userImpact: estimateUserImpact(alert),
startTime: new Date(),
};
}
2. Response (Minutes 5-10)
Immediate Actions:
1. Acknowledge the incident
2. Determine severity
3. Page appropriate team
4. Create incident channel (#incident-YYYY-MM-DD-description)
5. Assign roles
6. Start incident log
Incident Roles:
Incident Commander (IC):
- Coordinates response
- Makes decisions
- Manages communication
Technical Lead:
- Drives technical investigation
- Implements fixes
- Coordinates engineers
Communications Lead:
- Updates stakeholders
- Posts status updates
- Manages customer communication
Scribe:
- Documents timeline
- Records decisions
- Maintains incident log
Incident Channel Setup:
# Slack channel naming
#incident-2025-01-15-auth-service-down
# Pin critical information
- Severity: SEV 1
- Started: 2025-01-15 14:23 UTC
- Roles:
- IC: @alice
- Tech Lead: @bob
- Comms: @charlie
- Scribe: @dave
- Status Page: https://status.company.com
- War Room: https://zoom.us/incident-room
3. Investigation (Minutes 10-30)
Systematic Debugging:
class IncidentInvestigation {
async investigate(incident: Incident): Promise<RootCause> {
// 1. Check recent changes
const recentDeploys = await getRecentDeployments({ hours: 2 });
const recentConfigChanges = await getConfigChanges({ hours: 2 });
// 2. Review metrics
const metrics = await getMetrics({
services: incident.affectedSystems,
timeRange: { start: incident.startTime, end: new Date() },
});
// 3. Analyze logs
const errors = await searchLogs({
level: 'ERROR',
services: incident.affectedSystems,
since: incident.startTime,
});
// 4. Check dependencies
const dependencyHealth = await checkDependencies(incident.affectedSystems);
// 5. Form hypothesis
const hypothesis = this.formHypothesis({
recentChanges: [...recentDeploys, ...recentConfigChanges],
metrics,
errors,
dependencyHealth,
});
return hypothesis;
}
formHypothesis(data: InvestigationData): RootCause {
// Correlate data to find most likely cause
// ...
}
}
Common Investigation Queries:
# Check for recent deployments
kubectl rollout history deployment/auth-service
# View recent errors
grep "ERROR" /var/log/app.log | tail -100
# Check resource usage
kubectl top pods -n production
# Analyze traffic patterns
SELECT COUNT(*) FROM requests
WHERE timestamp > NOW() - INTERVAL '1 hour'
GROUP BY status_code;
# Review configuration changes
git log --since="2 hours ago" config/
# Check database connections
SELECT COUNT(*) FROM pg_stat_activity
WHERE state = 'active';
4. Mitigation (Minutes 30-60)
Mitigation Strategies:
Immediate Fixes (Stop the Bleeding):
// 1. Rollback recent deployment
await rollback(previousVersion);
// 2. Disable problematic feature
await featureFlags.disable('new-feature');
// 3. Scale up resources
await scaleService('auth-service', { replicas: 10 });
// 4. Switch to backup system
await failover('primary-db', 'backup-db');
// 5. Rate limit
await applyRateLimit({ requests: 100, period: '1m' });
// 6. Circuit breaker
await enableCircuitBreaker('external-api', { timeout: '5s' });
Decision Tree:
Is there a recent deployment?
├─ Yes: Rollback immediately
└─ No: Continue investigation
Is resource exhausted (CPU/Memory)?
├─ Yes: Scale up immediately
└─ No: Continue investigation
Is external dependency failing?
├─ Yes: Enable circuit breaker / Use backup
└─ No: Continue investigation
Is database overloaded?
├─ Yes: Enable read replicas / Cache
└─ No: Continue investigation
5. Communication
Status Updates (every 15-30 minutes):
# Initial Update
**Status**: Investigating
**Impact**: Auth service experiencing high error rates. Users may be unable to log in.
**Started**: 14:23 UTC
**Next Update**: 14:45 UTC
# Progress Update
**Status**: Mitigation in progress
**Impact**: Continuing. We have identified the root cause and are deploying a fix.
**Actions**: Rolling back to previous version (v1.2.3)
**ETA**: Fix expected by 15:00 UTC
**Next Update**: 15:00 UTC
# Resolution Update
**Status**: Resolved
**Impact**: Auth service restored. All functionality operational.
**Resolution**: Rolled back deployment. Root cause: Memory leak in v1.3.0
**Next Steps**: Post-mortem scheduled for tomorrow 10am
Communication Templates:
# SEV 1 - Initial (Immediate)
Subject: [SEV1] Auth Service Outage
Body:
We are experiencing a critical outage affecting user authentication.
Impact: Users unable to log in
Started: 14:23 UTC
Team: Actively investigating
We will provide updates every 15 minutes.
# SEV 1 - Resolution
Subject: [RESOLVED] Auth Service Outage
Body:
The auth service outage has been resolved.
Duration: 37 minutes (14:23 - 15:00 UTC)
Impact: ~10,000 users affected
Resolution: Rolled back to v1.2.3
A detailed post-mortem will follow within 48 hours.
We apologize for the disruption.
6. Resolution
Verification Checklist:
- [ ] Metrics returned to normal
- [ ] Error rates back to baseline
- [ ] User reports stopped
- [ ] Synthetic monitoring passing
- [ ] Team confirms resolution
- [ ] Waited 30+ minutes (no recurrence)
Resolution Actions:
async function resolveIncident(incidentId: string): Promise<void> {
// 1. Verify fix
const verified = await verifyResolution(incidentId);
if (!verified) {
throw new Error('Resolution not verified');
}
// 2. Update status page
await statusPage.update({
status: 'operational',
message: 'All systems operational',
});
// 3. Send resolution notification
await notify.sendResolution(incidentId, {
channels: ['slack', 'email', 'status-page'],
});
// 4. Update incident record
await incidents.update(incidentId, {
status: 'resolved',
resolvedAt: new Date(),
resolution: 'Rolled back to previous version',
});
// 5. Schedule post-mortem
await calendar.createEvent({
title: `Post-Mortem: ${incidentId}`,
time: '48 hours from now',
attendees: incident.team,
});
}
Post-Incident Activities
1. Post-Mortem (Within 48 hours)
Template:
# Post-Mortem: Auth Service Outage (2025-01-15)
## Summary
On Jan 15 2025, auth service experienced 37-minute outage affecting ~10,000 users.
## Timeline (UTC)
| Time | Event |
|-------|-------|
| 14:20 | Deployment of v1.3.0 started |
| 14:23 | Error rates spiked to 45% |
| 14:25 | PagerDuty alert triggered |
| 14:27 | Incident declared SEV 1 |
| 14:30 | Investigation started |
| 14:42 | Root cause identified: Memory leak |
| 14:45 | Rollback initiated |
| 15:00 | Service restored |
## Root Cause
Memory leak in v1.3.0 caused OOM crashes in auth service pods.
Specifically: Unclosed database connections in new auth flow.
## Impact
- Duration: 37 minutes
- Users affected: ~10,000
- Revenue impact: $5,000 (estimated)
- Customer support tickets: 47
## What Went Well
✅ Fast detection (3 minutes)
✅ Clear communication
✅ Swift rollback decision
✅ Team coordination excellent
## What Went Wrong
❌ Memory leak not caught in testing
❌ No gradual rollout (straight to 100%)
❌ Load testing insufficient
❌ Rollback took longer than expected
## Action Items
- [ ] Add memory leak detection to CI (@alice, Jan 20)
- [ ] Implement canary deployments (@bob, Jan 25)
- [ ] Improve load testing (@charlie, Jan 30)
- [ ] Automate rollback (@dave, Feb 5)
- [ ] Add connection pool monitoring (@eve, Jan 22)
## Lessons Learned
1. Always use canary deployments for auth changes
2. Monitor connection pools proactively
3. Load tests should match production traffic
2. Blameless Culture
DO: ✅ Focus on systems and processes ✅ Ask "How can we prevent this?" ✅ Celebrate quick response ✅ Learn from mistakes ✅ Assume good intentions
DON'T: ❌ Blame individuals ❌ Ask "Who caused this?" ❌ Punish for mistakes ❌ Hide problems ❌ Assume malice
Example Phrasing:
❌ "Bob deployed broken code"
✅ "Deployment process didn't catch memory leak"
❌ "Why didn't you test this?"
✅ "What testing would have caught this?"
❌ "This is your fault"
✅ "What can we learn from this?"
Incident Prevention
Proactive Measures
// 1. Comprehensive monitoring
const monitors = [
{ metric: 'error_rate', threshold: '> 5%', action: 'alert' },
{ metric: 'latency_p95', threshold: '> 1000ms', action: 'alert' },
{ metric: 'memory_usage', threshold: '> 80%', action: 'alert' },
{ metric: 'disk_space', threshold: '> 90%', action: 'alert' },
];
// 2. Chaos engineering
async function chaosTest() {
// Randomly kill pods
await killRandomPod();
// Inject latency
await injectLatency({ service: 'api', latency: '500ms' });
// Simulate dependency failure
await simulateFailure({ service: 'database', duration: '5m' });
// Measure system resilience
const resilience = await measureResilience();
return resilience;
}
// 3. Game days
// Scheduled incident simulations to practice response
// 4. Runbooks
// Documented procedures for common incidents
Incident Runbooks
Database Connection Exhaustion
## Symptoms
- Error: "Too many connections"
- Slow queries
- Timeouts
## Quick Fix
1. Scale up connection pool:
kubectl set env deployment/api DB_POOL_SIZE=50
2. Restart stuck connections:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '1 hour';
## Investigation
- Check active connections:
SELECT COUNT(*) FROM pg_stat_activity;
- Find long-running queries:
SELECT pid, query, state_change FROM pg_stat_activity WHERE state != 'idle' ORDER BY state_change;
## Prevention
- Implement connection pooling
- Set connection timeouts
- Monitor connection usage
Output Format
When guiding incident response:
## Incident Response: ${IncidentName}
**Severity**: ${level}
**Immediate Actions**:
1. ${action1}
2. ${action2}
**Investigation Steps**:
- ${step1}
- ${step2}
**Mitigation Options**:
- ${option1}
- ${option2}
**Communication Plan**:
- ${update schedule}
- ${stakeholders}
Related Skills
deployment-strategies: For safe deploymentsmonitoring-setup: For early detectiondebugging-techniques: For root cause analysiscommunication-patterns: For stakeholder updates