name	runbook-creation
description	Create operational runbooks, playbooks, standard operating procedures (SOPs), and incident response guides. Use when documenting operational procedures, on-call guides, or incident response processes.

Runbook Creation

Overview

Create comprehensive operational runbooks that provide step-by-step procedures for common operational tasks, incident response, and system maintenance.

When to Use

Incident response procedures
Standard operating procedures (SOPs)
On-call playbooks
System maintenance guides
Disaster recovery procedures
Deployment runbooks
Escalation procedures
Service restoration guides

Incident Response Runbook Template

# Incident Response Runbook

## Quick Reference

**Severity Levels:**
- P0 (Critical): Complete outage, data loss, security breach
- P1 (High): Major feature down, significant user impact
- P2 (Medium): Minor feature degradation, limited user impact
- P3 (Low): Cosmetic issues, minimal user impact

**Response Times:**
- P0: Immediate (24/7)
- P1: 15 minutes (business hours), 1 hour (after hours)
- P2: 4 hours (business hours)
- P3: Next business day

**Escalation Contacts:**
- On-call Engineer: PagerDuty rotation
- Engineering Manager: +1-555-0100
- VP Engineering: +1-555-0101
- CTO: +1-555-0102

## Table of Contents

1. [Service Down](#service-down)
2. [Database Issues](#database-issues)
3. [High CPU/Memory Usage](#high-cpu-memory-usage)
4. [API Performance Degradation](#api-performance-degradation)
5. [Security Incidents](#security-incidents)
6. [Data Loss Recovery](#data-loss-recovery)
7. [Rollback Procedures](#rollback-procedures)

---

## Service Down

### Symptoms
- Health check endpoint returning 500 errors
- Users unable to access application
- Load balancer showing all instances unhealthy
- Alerts: `service_down`, `health_check_failed`

### Severity: P0 (Critical)

### Initial Response (5 minutes)

1. **Acknowledge the incident**
   ```bash
   # Acknowledge in PagerDuty
   # Post in #incidents Slack channel

Create incident channel

Create Slack channel: #incident-YYYY-MM-DD-service-down
Post incident details and status updates

Assess impact

# Check service status
kubectl get pods -n production

# Check recent deployments
kubectl rollout history deployment/api -n production

# Check logs
kubectl logs -f deployment/api -n production --tail=100

Investigation Steps

Check Application Health

# 1. Check pod status
kubectl get pods -n production -l app=api

# Expected output: All pods Running
# NAME                   READY   STATUS    RESTARTS   AGE
# api-7d8c9f5b6d-4xk2p   1/1     Running   0          2h
# api-7d8c9f5b6d-7nm8r   1/1     Running   0          2h

# 2. Check pod logs for errors
kubectl logs -f deployment/api -n production --tail=100 | grep -i error

# 3. Check application endpoints
curl -v https://api.example.com/health
curl -v https://api.example.com/api/v1/status

# 4. Check database connectivity
kubectl exec -it deployment/api -n production -- sh
psql $DATABASE_URL -c "SELECT 1"

Check Infrastructure

# 1. Check load balancer
aws elb describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --query 'TargetHealthDescriptions[*].[Target.Id,TargetHealth.State]' \
  --output table

# 2. Check DNS resolution
dig api.example.com
nslookup api.example.com

# 3. Check SSL certificates
echo | openssl s_client -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# 4. Check network connectivity
kubectl exec -it deployment/api -n production -- \
  curl -v https://database.example.com:5432

Check Database

# 1. Check database connections
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

# 2. Check for locks
psql $DATABASE_URL -c "
  SELECT pid, usename, pg_blocking_pids(pid) as blocked_by, query
  FROM pg_stat_activity
  WHERE cardinality(pg_blocking_pids(pid)) > 0
"

# 3. Check database size
psql $DATABASE_URL -c "
  SELECT pg_size_pretty(pg_database_size(current_database()))
"

# 4. Check long-running queries
psql $DATABASE_URL -c "
  SELECT pid, now() - query_start as duration, query
  FROM pg_stat_activity
  WHERE state = 'active'
  ORDER BY duration DESC
  LIMIT 10
"

Resolution Steps

Option 1: Restart Pods (Quick Fix)

# Restart all pods (rolling restart)
kubectl rollout restart deployment/api -n production

# Watch restart progress
kubectl rollout status deployment/api -n production

# Verify pods are healthy
kubectl get pods -n production -l app=api

Option 2: Scale Up (If Overload)

# Check current replicas
kubectl get deployment api -n production

# Scale up
kubectl scale deployment/api -n production --replicas=10

# Watch scaling
kubectl get pods -n production -l app=api -w

Option 3: Rollback (If Bad Deploy)

# Check deployment history
kubectl rollout history deployment/api -n production

# Rollback to previous version
kubectl rollout undo deployment/api -n production

# Rollback to specific revision
kubectl rollout undo deployment/api -n production --to-revision=5

# Verify rollback
kubectl rollout status deployment/api -n production

Option 4: Database Connection Reset

# If database connection pool exhausted
kubectl exec -it deployment/api -n production -- sh
kill -HUP 1  # Reload process, reset connections

# Or restart database connection pool
psql $DATABASE_URL -c "SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE application_name = 'api'
  AND state = 'idle'"

Verification

# 1. Check health endpoint
curl https://api.example.com/health
# Expected: {"status": "healthy"}

# 2. Check API endpoints
curl https://api.example.com/api/v1/users
# Expected: Valid JSON response

# 3. Check metrics
# Visit https://grafana.example.com
# Verify:
# - Error rate < 1%
# - Response time < 500ms
# - All pods healthy

# 4. Check logs for errors
kubectl logs deployment/api -n production --tail=100 | grep -i error
# Expected: No new errors

Communication

Initial Update (within 5 minutes):

🚨 INCIDENT: Service Down

Status: Investigating
Severity: P0
Impact: All users unable to access application
Start Time: 2025-01-15 14:30 UTC

We are investigating reports of users unable to access the application.
Our team is working to identify the root cause.

Next update in 15 minutes.

Progress Update (every 15 minutes):

🔍 UPDATE: Service Down

Status: Identified
Root Cause: Database connection pool exhausted
Action: Restarting application pods
ETA: 5 minutes

We have identified the issue and are implementing a fix.

Resolution Update:

✅ RESOLVED: Service Down

Status: Resolved
Resolution: Restarted application pods, reset database connections
Duration: 23 minutes

The service is now fully operational. We are monitoring closely
and will conduct a post-mortem to prevent future occurrences.

Post-Incident

Create post-mortem document
- Timeline of events
- Root cause analysis
- Action items to prevent recurrence
Update monitoring
- Add alerts for this scenario
- Improve detection time
Update runbook
- Document any new findings
- Add shortcuts for faster resolution

Database Issues

High Connection Count

Symptoms:

Database rejecting new connections
Error: "too many connections"
Alert: db_connections_high

Quick Fix:

# 1. Check connection count
psql $DATABASE_URL -c "
  SELECT count(*), application_name
  FROM pg_stat_activity
  GROUP BY application_name
"

# 2. Kill idle connections
psql $DATABASE_URL -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE state = 'idle'
  AND query_start < now() - interval '10 minutes'
"

# 3. Restart connection pools
kubectl rollout restart deployment/api -n production

Slow Queries

Symptoms:

API response times > 5 seconds
Database CPU at 100%
Alert: slow_query_detected

Investigation:

-- Find slow queries
SELECT
  pid,
  now() - query_start as duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY duration DESC
LIMIT 10;

-- Check for missing indexes
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 10;

-- Kill long-running query (if needed)
SELECT pg_terminate_backend(12345);  -- Replace with actual PID

High CPU/Memory Usage

Symptoms

Pods being OOMKilled
Response times increasing
Alert: high_memory_usage, high_cpu_usage

Investigation

# 1. Check pod resources
kubectl top pods -n production

# 2. Check resource limits
kubectl describe pod <pod-name> -n production | grep -A 5 Limits

# 3. Check for memory leaks
kubectl logs deployment/api -n production | grep -i "out of memory"

# 4. Profile application (if needed)
kubectl exec -it <pod-name> -n production -- sh
# Run profiler: node --inspect, py-spy, etc.

Resolution

# Option 1: Increase resources
kubectl set resources deployment/api -n production \
  --limits=cpu=2000m,memory=4Gi \
  --requests=cpu=1000m,memory=2Gi

# Option 2: Scale horizontally
kubectl scale deployment/api -n production --replicas=6

# Option 3: Restart problematic pods
kubectl delete pod <pod-name> -n production

Rollback Procedures

Application Rollback

# 1. List deployment history
kubectl rollout history deployment/api -n production

# 2. Check specific revision
kubectl rollout history deployment/api -n production --revision=5

# 3. Rollback to previous
kubectl rollout undo deployment/api -n production

# 4. Rollback to specific revision
kubectl rollout undo deployment/api -n production --to-revision=5

# 5. Verify rollback
kubectl rollout status deployment/api -n production
kubectl get pods -n production

Database Rollback

# 1. Check migration status
npm run db:migrate:status

# 2. Rollback last migration
npm run db:migrate:undo

# 3. Rollback to specific migration
npm run db:migrate:undo --to 20250115120000-migration-name

# 4. Verify database state
psql $DATABASE_URL -c "\dt"

Escalation Path

Level 1 - On-call Engineer (You)
- Initial response and investigation
- Attempt standard fixes from runbook
Level 2 - Senior Engineers
- Escalate if not resolved in 30 minutes
- Escalate if issue is complex/unclear
- Contact via PagerDuty or Slack
Level 3 - Engineering Manager
- Escalate if not resolved in 1 hour
- Escalate if cross-team coordination needed
Level 4 - VP Engineering / CTO
- Escalate for P0 incidents > 2 hours
- Escalate for security breaches
- Escalate for data loss

Useful Commands

# Kubernetes
kubectl get pods -n production
kubectl logs -f <pod-name> -n production
kubectl describe pod <pod-name> -n production
kubectl exec -it <pod-name> -n production -- sh
kubectl top pods -n production

# Database
psql $DATABASE_URL -c "SELECT version()"
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity"

# AWS
aws ecs list-tasks --cluster production
aws rds describe-db-instances
aws cloudwatch get-metric-statistics ...

# Monitoring URLs
# Grafana: https://grafana.example.com
# Datadog: https://app.datadoghq.com
# PagerDuty: https://example.pagerduty.com
# Status Page: https://status.example.com


## Best Practices

### ✅ DO
- Include quick reference section at top
- Provide exact commands to run
- Document expected outputs
- Include verification steps
- Add communication templates
- Define severity levels clearly
- Document escalation paths
- Include useful links and contacts
- Keep runbooks up-to-date
- Test runbooks regularly
- Include screenshots/diagrams
- Document common gotchas

### ❌ DON'T
- Use vague instructions
- Skip verification steps
- Forget to document prerequisites
- Assume knowledge of tools
- Skip communication guidelines
- Forget to update after incidents

## Resources

- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Google SRE Book](https://sre.google/books/)
- [Atlassian Incident Handbook](https://www.atlassian.com/incident-management/handbook)

runbook-creation

Install Skill

SKILL.md