name	runbooks-structure
description	Use when creating structured operational runbooks for human operators. Covers runbook organization, documentation patterns, and best practices for clear operational procedures.
allowed-tools	Read, Write, Edit, Bash, Grep, Glob

Runbooks - Structure

Creating clear, actionable runbooks for operational tasks, maintenance, and troubleshooting.

What is a Runbook?

A runbook is step-by-step documentation for operational tasks:

Troubleshooting - Diagnosing and fixing issues
Incident Response - Handling production incidents
Maintenance - Routine operational tasks
On-Call - Reference for on-call engineers

Basic Runbook Structure

Minimum Viable Runbook

# Service Name: Task/Issue

## Overview
Brief description of what this runbook covers.

## Prerequisites
- Required access/permissions
- Tools needed
- Knowledge required

## Steps

### 1. First Step
Detailed instructions for first action.

### 2. Second Step
Detailed instructions for second action.

## Validation
How to verify the task was completed successfully.

## Rollback (if applicable)
How to undo the changes if needed.

Comprehensive Runbook Template

# [Service]: [Task/Issue Title]

**Last Updated:** 2025-01-15
**Owner:** Platform Team
**Severity:** High/Medium/Low
**Estimated Time:** 15 minutes

## Overview

Brief description of the problem or task this runbook addresses.

## When to Use This Runbook

- Alert fired: `high_cpu_usage`
- Customer report: slow response times
- Scheduled maintenance window

## Prerequisites

- [ ] VPN access to production network
- [ ] AWS console access (read/write)
- [ ] kubectl configured for production cluster
- [ ] Slack access to #incidents channel

## Context

### Architecture Overview
Brief explanation of relevant system architecture.

### Common Causes
- Database connection pool exhaustion
- Memory leaks in worker processes
- Third-party API rate limiting

## Diagnosis Steps

### 1. Check System Health

```bash
# Check pod status
kubectl get pods -n production

# Expected output: All pods Running

Decision Point: If pods are CrashLooping, proceed to step 2. Otherwise, skip to step 3.

2. Check Application Logs

# View recent logs
kubectl logs -n production deployment/api-server --tail=100

Look for:

Error messages containing "connection refused"
Stack traces
Repeated warnings

3. Check Database Performance

# Connect to RDS metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=prod-db

Resolution Steps

Option A: Restart Services (Quick Fix)

Drain pod gracefully:

kubectl drain pod/api-server-abc123 --ignore-daemonsets

Verify new pod is healthy:

kubectl get pods -n production | grep api-server

Monitor for 5 minutes: Check Datadog dashboard for error rates.

Option B: Scale Resources (Long-term Fix)

Increase database connection pool:

# Edit configmap
kubectl edit configmap app-config -n production

# Change: DB_POOL_SIZE: "10"
# To: DB_POOL_SIZE: "20"

Restart deployment:

kubectl rollout restart deployment/api-server -n production

Monitor rollout:

kubectl rollout status deployment/api-server -n production

Validation

All pods in Running state: kubectl get pods -n production
Error rate < 1%: Check Datadog dashboard
Response time < 200ms p95: Check Datadog dashboard
No alerts firing: Check PagerDuty

Rollback

If the fix causes issues:

# Rollback to previous version
kubectl rollout undo deployment/api-server -n production

# Verify rollback
kubectl rollout status deployment/api-server -n production

Follow-up Actions

Create post-mortem ticket: JIRA-1234
Update monitoring alerts if needed
Schedule post-incident review
Document learnings in team wiki

Related Runbooks

Contact Information

Primary On-Call: Check PagerDuty schedule
Escalation: Platform Team Lead
Slack Channel: #platform-incidents

Revision History

Date	Author	Changes
2025-01-15	Alice	Added validation steps
2025-01-10	Bob	Initial version


## Runbook Organization

### Directory Structure

runbooks/ ├── README.md # Index of all runbooks ├── templates/ │ ├── troubleshooting.md # Template for troubleshooting │ ├── maintenance.md # Template for maintenance tasks │ └── incident-response.md # Template for incidents ├── services/ │ ├── api-gateway/ │ │ ├── high-latency.md │ │ ├── connection-errors.md │ │ └── scaling.md │ └── database/ │ ├── slow-queries.md │ └── replication-lag.md ├── infrastructure/ │ ├── kubernetes/ │ │ ├── pod-restart.md │ │ └── node-drain.md │ └── networking/ │ └── firewall-rules.md └── on-call/ ├── first-responder.md ├── escalation-guide.md └── common-alerts.md


## Best Practices

### Write for Clarity

```markdown
# Good: Clear, specific steps
## Restart the API Server

1. Check current pod status:
   ```bash
   kubectl get pods -n production -l app=api-server

Delete the pod (it will auto-restart):

kubectl delete pod <pod-name> -n production

Verify new pod is running:
```
kubectl get pods -n production -l app=api-server
```
Expected: STATUS = Running


```markdown
# Bad: Vague, assumes knowledge
## Fix API Issues

1. Restart the thing
2. Check if it works
3. If not, try something else

Include Decision Trees

## Diagnosis

1. Check if service is responding:
   ```bash
   curl -f https://api.example.com/health

If succeeds: Service is up. Check application logs (Step 3). If fails: Service is down. Check pod status (Step 2).

Check pod status:
```
kubectl get pods -n production
```
If CrashLooping: Check logs for errors (Step 4). If Pending: Check node resources (Step 5). If Running: Service should be up. Check DNS (Step 6).


### Provide Expected Outputs

```markdown
## Check Database Connection

```bash
psql -h prod-db.example.com -U app -c "SELECT 1"

Expected output:

 ?column?
----------
        1
(1 row)

If you see "connection refused":

Check security groups allow traffic
Verify database is running
Check credentials in secrets manager


### Use Checklists

```markdown
## Pre-Deployment Checklist

- [ ] Code reviewed and approved
- [ ] Tests passing in CI
- [ ] Database migrations tested
- [ ] Rollback plan documented
- [ ] Monitoring alerts configured
- [ ] On-call engineer notified
- [ ] Deploy window scheduled

Common Patterns

Emergency Response

# EMERGENCY: Database Down

**Time-Sensitive:** Complete within 5 minutes

## Immediate Actions (Do First)

1. **Page database team:**
   Use PagerDuty incident: "Database outage"

2. **Notify stakeholders:**
   Post in #incidents: "Database is down. Investigating."

3. **Enable maintenance mode:**
   ```bash
   kubectl set env deployment/api-server MAINTENANCE_MODE=true

Investigation (While Waiting)

Check RDS console for instance status
Review CloudWatch logs for errors
Check recent changes in deploy history

Communication Template

Update every 10 minutes in #incidents:

[HH:MM] Database still down. Database team investigating. Impact: All API requests failing. ETA: Unknown.


### Routine Maintenance

```markdown
# Monthly Database Maintenance

**Schedule:** First Sunday of each month, 2 AM UTC
**Duration:** ~2 hours
**Downtime:** None (rolling maintenance)

## Pre-Maintenance

**1 week before:**
- [ ] Announce maintenance window in #engineering
- [ ] Create calendar event
- [ ] Prepare rollback plan

**1 day before:**
- [ ] Verify backup is recent (< 24 hours)
- [ ] Test backup restoration (staging)
- [ ] Confirm on-call coverage

## Maintenance Steps

1. **Take snapshot:**
   ```bash
   aws rds create-db-snapshot \
     --db-instance-identifier prod-db \
     --db-snapshot-identifier maintenance-$(date +%Y%m%d)

Apply updates: Follow RDS console update wizard
Monitor reboot: Watch CloudWatch metrics for 15 minutes

Post-Maintenance

Verify all services healthy
Post completion in #engineering
Update maintenance log


### Knowledge Transfer

```markdown
# New Engineer Onboarding: Deploy Process

## Learning Objectives

After completing this runbook, you will:
- Understand our deployment pipeline
- Know how to deploy to staging
- Know how to rollback a bad deploy

## Step 1: Understanding the Pipeline

Our deployment flow:

GitHub → CI (GitHub Actions) → ArgoCD → Kubernetes


**Exercise:** Review a recent deploy in ArgoCD

## Step 2: Deploying to Staging

**Shadowing:** Watch a senior engineer deploy first.

1. Create feature branch
2. Open pull request
3. Wait for CI to pass
4. Merge to main
5. Watch ArgoCD sync

**Practice:** Deploy your own change to staging.

## Step 3: Deploying to Production

**Requirements before production deploy:**
- [ ] Completed 3+ staging deploys
- [ ] Reviewed with team lead
- [ ] Read incident response runbook

**Hands-on:** Pair with team lead for first production deploy.

## Certification

- [ ] Completed 5 staging deploys
- [ ] Completed 1 production deploy (supervised)
- [ ] Can explain rollback procedure

**Certified by:** _________________ Date: _______

Anti-Patterns

Don't Leave Out Context

## Bad: No context

### Fix the API

Run this command:
```bash
kubectl delete pod api-server-abc

Good: Explain why

Restart API Server

When to use: API returning 500 errors, logs show memory leak.

Why this works: Deletes pod, Kubernetes creates fresh pod with clean state.

kubectl delete pod api-server-abc

What to expect: ~30 seconds downtime while new pod starts.


### Don't Skip Validation Steps

```markdown
# Bad: No validation
## Deploy New Version

1. Update image tag
2. Apply changes
3. Done!

# Good: Include validation
## Deploy New Version

1. Update image tag in deployment.yaml
2. Apply changes:
   ```bash
   kubectl apply -f deployment.yaml

Verify rollout:
```
kubectl rollout status deployment/api-server
```
Wait for "successfully rolled out"
Check pod health:
```
kubectl get pods -l app=api-server
```
All pods should show STATUS: Running
Verify endpoints:
```
curl https://api.example.com/health
```
Should return 200 OK


### Don't Assume Knowledge

```markdown
# Bad: Uses jargon without explanation
## Fix Pod

Check the ingress and make sure the service mesh is working.

# Good: Explains terms
## Fix Pod Connectivity

1. **Check ingress** (the load balancer that routes traffic to pods):
   ```bash
   kubectl get ingress -n production

Verify service mesh (Istio, manages pod-to-pod communication):
```
kubectl get pods -n istio-system
```
All Istio control plane pods should be Running.


## Related Skills

- **troubleshooting-guides**: Creating diagnostic procedures
- **incident-response**: Handling production incidents

runbooks-structure

Install Skill

SKILL.md