Claude Code Plugins

Community-maintained marketplace

Feedback

incident-runbook-generator

@patricio0312rev/skillset
2
0

Creates step-by-step incident response runbooks for common outages with actions, owners, rollback procedures, and communication templates. Use for "incident runbook", "outage response", "incident management", or "on-call procedures".

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name incident-runbook-generator
description Creates step-by-step incident response runbooks for common outages with actions, owners, rollback procedures, and communication templates. Use for "incident runbook", "outage response", "incident management", or "on-call procedures".

Incident Runbook Generator

Create actionable runbooks for common incidents.

Runbook Template

# Runbook: Database Connection Pool Exhausted

**Severity:** P1 (Critical)
**Estimated Time to Resolve:** 15-30 minutes
**Owner:** Database Team (On-call)

## Symptoms

- Application errors: "connection pool exhausted"
- Increased API latency (>5s)
- Failed health checks
- CloudWatch alarm: `DatabaseConnectionsHigh`

## Detection

- Alert: DatabaseConnectionPoolExhausted
- Metrics: `active_connections > max_connections * 0.9`
- Logs: "Error: connect ETIMEDOUT"

## Immediate Actions (5 min)

1. **Verify the issue**
   ```bash
   # Check current connections
   SELECT count(*) FROM pg_stat_activity;
   ```
  1. Identify long-running queries

    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state = 'active'
    ORDER BY duration DESC
    LIMIT 10;
    
  2. Kill blocking queries (if safe)

    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state = 'idle in transaction'
    AND now() - state_change > interval '5 minutes';
    

Mitigation (10 min)

  1. Scale up connection pool (temporary)

    # Update RDS parameter group
    aws rds modify-db-parameter-group \
      --db-parameter-group-name prod-params \
      --parameters "ParameterName=max_connections,ParameterValue=200"
    
  2. Restart application (if needed)

    kubectl rollout restart deployment/api
    
  3. Monitor recovery

    watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity;"'
    

Root Cause Investigation

Check for:

  • Recent deployment (new code with connection leaks)
  • Traffic spike (legitimate or DDoS)
  • Slow queries holding connections
  • Connection pool configuration too small
  • Application not releasing connections

Rollback Steps

If caused by deployment:

# Rollback to previous version
kubectl rollout undo deployment/api

# Verify
kubectl rollout status deployment/api

Communication Template

Initial (within 5 min):

🚨 INCIDENT: Database connection pool exhausted
Status: Investigating
Impact: API errors and slowness
ETA: 15-30 min
Next update: 10 min

Update (every 10 min):

UPDATE: Killed long-running queries
Status: Mitigating
Impact: Still degraded, improving
Actions: Scaling connection pool
Next update: 10 min

Resolution:

✅ RESOLVED: Database connections normalized
Duration: 25 minutes
Root cause: Connection leak in v2.3.4
Fix: Rolled back to v2.3.3
Follow-up: Bug fix PR #1234
Postmortem: [link]

Prevention

  • Add connection pool metrics to dashboards
  • Implement connection timeout (30s)
  • Add connection leak detection in tests
  • Set up pre-deployment load testing
  • Review connection pool sizing

Related Runbooks

  • Database High CPU
  • Slow Database Queries
  • Application OOM

## Output Checklist

- [ ] Symptoms documented
- [ ] Detection criteria
- [ ] Step-by-step actions
- [ ] Owner assigned
- [ ] Rollback procedure
- [ ] Communication templates
- [ ] Prevention measures
ENDFILE