| name | runbooks-structure |
| description | Use when creating structured operational runbooks for human operators. Covers runbook organization, documentation patterns, and best practices for clear operational procedures. |
| allowed-tools | Read, Write, Edit, Bash, Grep, Glob |
Runbooks - Structure
Creating clear, actionable runbooks for operational tasks, maintenance, and troubleshooting.
What is a Runbook?
A runbook is step-by-step documentation for operational tasks:
- Troubleshooting - Diagnosing and fixing issues
- Incident Response - Handling production incidents
- Maintenance - Routine operational tasks
- On-Call - Reference for on-call engineers
Basic Runbook Structure
Minimum Viable Runbook
# Service Name: Task/Issue
## Overview
Brief description of what this runbook covers.
## Prerequisites
- Required access/permissions
- Tools needed
- Knowledge required
## Steps
### 1. First Step
Detailed instructions for first action.
### 2. Second Step
Detailed instructions for second action.
## Validation
How to verify the task was completed successfully.
## Rollback (if applicable)
How to undo the changes if needed.
Comprehensive Runbook Template
# [Service]: [Task/Issue Title]
**Last Updated:** 2025-01-15
**Owner:** Platform Team
**Severity:** High/Medium/Low
**Estimated Time:** 15 minutes
## Overview
Brief description of the problem or task this runbook addresses.
## When to Use This Runbook
- Alert fired: `high_cpu_usage`
- Customer report: slow response times
- Scheduled maintenance window
## Prerequisites
- [ ] VPN access to production network
- [ ] AWS console access (read/write)
- [ ] kubectl configured for production cluster
- [ ] Slack access to #incidents channel
## Context
### Architecture Overview
Brief explanation of relevant system architecture.
### Common Causes
- Database connection pool exhaustion
- Memory leaks in worker processes
- Third-party API rate limiting
## Diagnosis Steps
### 1. Check System Health
```bash
# Check pod status
kubectl get pods -n production
# Expected output: All pods Running
Decision Point: If pods are CrashLooping, proceed to step 2. Otherwise, skip to step 3.
2. Check Application Logs
# View recent logs
kubectl logs -n production deployment/api-server --tail=100
Look for:
- Error messages containing "connection refused"
- Stack traces
- Repeated warnings
3. Check Database Performance
# Connect to RDS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=prod-db
Resolution Steps
Option A: Restart Services (Quick Fix)
Drain pod gracefully:
kubectl drain pod/api-server-abc123 --ignore-daemonsetsVerify new pod is healthy:
kubectl get pods -n production | grep api-serverMonitor for 5 minutes: Check Datadog dashboard for error rates.
Option B: Scale Resources (Long-term Fix)
Increase database connection pool:
# Edit configmap kubectl edit configmap app-config -n production # Change: DB_POOL_SIZE: "10" # To: DB_POOL_SIZE: "20"Restart deployment:
kubectl rollout restart deployment/api-server -n productionMonitor rollout:
kubectl rollout status deployment/api-server -n production
Validation
- All pods in Running state:
kubectl get pods -n production - Error rate < 1%: Check Datadog dashboard
- Response time < 200ms p95: Check Datadog dashboard
- No alerts firing: Check PagerDuty
Rollback
If the fix causes issues:
# Rollback to previous version
kubectl rollout undo deployment/api-server -n production
# Verify rollback
kubectl rollout status deployment/api-server -n production
Follow-up Actions
- Create post-mortem ticket: JIRA-1234
- Update monitoring alerts if needed
- Schedule post-incident review
- Document learnings in team wiki
Related Runbooks
Contact Information
- Primary On-Call: Check PagerDuty schedule
- Escalation: Platform Team Lead
- Slack Channel: #platform-incidents
Revision History
| Date | Author | Changes |
|---|---|---|
| 2025-01-15 | Alice | Added validation steps |
| 2025-01-10 | Bob | Initial version |
## Runbook Organization
### Directory Structure
runbooks/ ├── README.md # Index of all runbooks ├── templates/ │ ├── troubleshooting.md # Template for troubleshooting │ ├── maintenance.md # Template for maintenance tasks │ └── incident-response.md # Template for incidents ├── services/ │ ├── api-gateway/ │ │ ├── high-latency.md │ │ ├── connection-errors.md │ │ └── scaling.md │ └── database/ │ ├── slow-queries.md │ └── replication-lag.md ├── infrastructure/ │ ├── kubernetes/ │ │ ├── pod-restart.md │ │ └── node-drain.md │ └── networking/ │ └── firewall-rules.md └── on-call/ ├── first-responder.md ├── escalation-guide.md └── common-alerts.md
## Best Practices
### Write for Clarity
```markdown
# Good: Clear, specific steps
## Restart the API Server
1. Check current pod status:
```bash
kubectl get pods -n production -l app=api-server
Delete the pod (it will auto-restart):
kubectl delete pod <pod-name> -n productionVerify new pod is running:
kubectl get pods -n production -l app=api-serverExpected: STATUS = Running
```markdown
# Bad: Vague, assumes knowledge
## Fix API Issues
1. Restart the thing
2. Check if it works
3. If not, try something else
Include Decision Trees
## Diagnosis
1. Check if service is responding:
```bash
curl -f https://api.example.com/health
If succeeds: Service is up. Check application logs (Step 3). If fails: Service is down. Check pod status (Step 2).
Check pod status:
kubectl get pods -n productionIf CrashLooping: Check logs for errors (Step 4). If Pending: Check node resources (Step 5). If Running: Service should be up. Check DNS (Step 6).
### Provide Expected Outputs
```markdown
## Check Database Connection
```bash
psql -h prod-db.example.com -U app -c "SELECT 1"
Expected output:
?column?
----------
1
(1 row)
If you see "connection refused":
- Check security groups allow traffic
- Verify database is running
- Check credentials in secrets manager
### Use Checklists
```markdown
## Pre-Deployment Checklist
- [ ] Code reviewed and approved
- [ ] Tests passing in CI
- [ ] Database migrations tested
- [ ] Rollback plan documented
- [ ] Monitoring alerts configured
- [ ] On-call engineer notified
- [ ] Deploy window scheduled
Common Patterns
Emergency Response
# EMERGENCY: Database Down
**Time-Sensitive:** Complete within 5 minutes
## Immediate Actions (Do First)
1. **Page database team:**
Use PagerDuty incident: "Database outage"
2. **Notify stakeholders:**
Post in #incidents: "Database is down. Investigating."
3. **Enable maintenance mode:**
```bash
kubectl set env deployment/api-server MAINTENANCE_MODE=true
Investigation (While Waiting)
- Check RDS console for instance status
- Review CloudWatch logs for errors
- Check recent changes in deploy history
Communication Template
Update every 10 minutes in #incidents:
[HH:MM] Database still down. Database team investigating. Impact: All API requests failing. ETA: Unknown.
### Routine Maintenance
```markdown
# Monthly Database Maintenance
**Schedule:** First Sunday of each month, 2 AM UTC
**Duration:** ~2 hours
**Downtime:** None (rolling maintenance)
## Pre-Maintenance
**1 week before:**
- [ ] Announce maintenance window in #engineering
- [ ] Create calendar event
- [ ] Prepare rollback plan
**1 day before:**
- [ ] Verify backup is recent (< 24 hours)
- [ ] Test backup restoration (staging)
- [ ] Confirm on-call coverage
## Maintenance Steps
1. **Take snapshot:**
```bash
aws rds create-db-snapshot \
--db-instance-identifier prod-db \
--db-snapshot-identifier maintenance-$(date +%Y%m%d)
Apply updates: Follow RDS console update wizard
Monitor reboot: Watch CloudWatch metrics for 15 minutes
Post-Maintenance
- Verify all services healthy
- Post completion in #engineering
- Update maintenance log
### Knowledge Transfer
```markdown
# New Engineer Onboarding: Deploy Process
## Learning Objectives
After completing this runbook, you will:
- Understand our deployment pipeline
- Know how to deploy to staging
- Know how to rollback a bad deploy
## Step 1: Understanding the Pipeline
Our deployment flow:
GitHub → CI (GitHub Actions) → ArgoCD → Kubernetes
**Exercise:** Review a recent deploy in ArgoCD
## Step 2: Deploying to Staging
**Shadowing:** Watch a senior engineer deploy first.
1. Create feature branch
2. Open pull request
3. Wait for CI to pass
4. Merge to main
5. Watch ArgoCD sync
**Practice:** Deploy your own change to staging.
## Step 3: Deploying to Production
**Requirements before production deploy:**
- [ ] Completed 3+ staging deploys
- [ ] Reviewed with team lead
- [ ] Read incident response runbook
**Hands-on:** Pair with team lead for first production deploy.
## Certification
- [ ] Completed 5 staging deploys
- [ ] Completed 1 production deploy (supervised)
- [ ] Can explain rollback procedure
**Certified by:** _________________ Date: _______
Anti-Patterns
Don't Leave Out Context
## Bad: No context
### Fix the API
Run this command:
```bash
kubectl delete pod api-server-abc
Good: Explain why
Restart API Server
When to use: API returning 500 errors, logs show memory leak.
Why this works: Deletes pod, Kubernetes creates fresh pod with clean state.
kubectl delete pod api-server-abc
What to expect: ~30 seconds downtime while new pod starts.
### Don't Skip Validation Steps
```markdown
# Bad: No validation
## Deploy New Version
1. Update image tag
2. Apply changes
3. Done!
# Good: Include validation
## Deploy New Version
1. Update image tag in deployment.yaml
2. Apply changes:
```bash
kubectl apply -f deployment.yaml
Verify rollout:
kubectl rollout status deployment/api-serverWait for "successfully rolled out"
Check pod health:
kubectl get pods -l app=api-serverAll pods should show STATUS: Running
Verify endpoints:
curl https://api.example.com/healthShould return 200 OK
### Don't Assume Knowledge
```markdown
# Bad: Uses jargon without explanation
## Fix Pod
Check the ingress and make sure the service mesh is working.
# Good: Explains terms
## Fix Pod Connectivity
1. **Check ingress** (the load balancer that routes traffic to pods):
```bash
kubectl get ingress -n production
Verify service mesh (Istio, manages pod-to-pod communication):
kubectl get pods -n istio-systemAll Istio control plane pods should be Running.
## Related Skills
- **troubleshooting-guides**: Creating diagnostic procedures
- **incident-response**: Handling production incidents