| name | runbook-creation |
| description | Operational runbook templates for incident response and procedures |
| allowed-tools | Read, Glob, Grep, Write, Edit |
Runbook Creation Skill
When to Use This Skill
Use this skill when:
- Runbook Creation tasks - Working on operational runbook templates for incident response and procedures
- Planning or design - Need guidance on Runbook Creation approaches
- Best practices - Want to follow established patterns and standards
Overview
Create operational runbooks for incident response, maintenance procedures, and operational tasks.
MANDATORY: Documentation-First Approach
Before creating runbooks:
- Invoke
docs-managementskill for runbook patterns - Verify SRE best practices via MCP servers (perplexity)
- Base guidance on Google SRE principles
Runbook Types
Runbook Categories:
┌─────────────────────────────────────────────────────────────────────────────┐
│ Incident Response Runbooks │
│ • Alert-triggered procedures │
│ • Escalation paths │
│ • Communication templates │
├─────────────────────────────────────────────────────────────────────────────┤
│ Operational Runbooks │
│ • Deployment procedures │
│ • Maintenance tasks │
│ • Backup/restore operations │
├─────────────────────────────────────────────────────────────────────────────┤
│ Troubleshooting Runbooks │
│ • Diagnostic procedures │
│ • Common issue resolution │
│ • Debug workflows │
├─────────────────────────────────────────────────────────────────────────────┤
│ Emergency Runbooks │
│ • Disaster recovery │
│ • Security incident response │
│ • Business continuity │
└─────────────────────────────────────────────────────────────────────────────┘
Standard Runbook Template
# Runbook: [TITLE]
| Property | Value |
|----------|-------|
| **ID** | RB-[NUMBER] |
| **Category** | [Incident/Operational/Troubleshooting/Emergency] |
| **Service** | [Service Name] |
| **Owner** | [Team/Individual] |
| **Last Updated** | [YYYY-MM-DD] |
| **Last Tested** | [YYYY-MM-DD] |
| **Review Frequency** | [Quarterly/Monthly/Annually] |
---
## Overview
**Purpose:** [What this runbook helps you accomplish]
**When to Use:** [Conditions that trigger this runbook]
**Expected Outcome:** [What success looks like]
**Estimated Duration:** [Time to complete]
---
## Prerequisites
### Required Access
- [ ] [System/Tool 1] - [Role/Permission needed]
- [ ] [System/Tool 2] - [Role/Permission needed]
### Required Knowledge
- [Skill/Knowledge 1]
- [Skill/Knowledge 2]
### Tools Needed
| Tool | Purpose | Access URL |
|------|---------|------------|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |
---
## Quick Reference
```text
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace] │
│ View logs: kubectl logs -f [pod-name] -n [namespace] │
│ Restart service: kubectl rollout restart deployment/[name] │
│ Check metrics: [monitoring-url] │
└────────────────────────────────────────────────────────────────┘
Procedure
Step 1: [Step Name]
Objective: [What this step accomplishes]
Actions:
[Action 1]
# Command example kubectl get pods -n production[Action 2]
Expected Result: [What you should see]
If This Fails: Go to Troubleshooting Section
Step 2: [Step Name]
Objective: [What this step accomplishes]
Actions:
- [Action 1]
- [Action 2]
Decision Point:
┌─────────────────────────────────────┐
│ Is the service responding? │
│ │
│ YES → Continue to Step 3 │
│ NO → Go to Step 4 (Escalation) │
└─────────────────────────────────────┘
Step 3: [Verification]
Objective: Verify the issue is resolved
Verification Checklist:
- Service is responding to health checks
- Metrics show normal values
- No new errors in logs
- Users can access the service
Troubleshooting
Issue: [Common Issue 1]
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
- [Step 1]
- [Step 2]
Issue: [Common Issue 2]
Symptoms: [What you observe]
Cause: [Root cause]
Resolution:
- [Step 1]
- [Step 2]
Escalation
When to Escalate
- Issue not resolved after [X] minutes
- Impact affects [threshold]
- Required access not available
- Unsure of next steps
Escalation Path
| Level | Contact | Method | Response Time |
|---|---|---|---|
| L1 | On-call Engineer | PagerDuty | 15 min |
| L2 | Team Lead | Slack #incidents | 30 min |
| L3 | Engineering Manager | Phone | 1 hour |
| L4 | VP Engineering | Phone | As needed |
Communication
Status Updates
Template:
[TIMESTAMP] - [SERVICE] - [STATUS]
Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]
Actions Taken:
- [Action 1]
- [Action 2]
Next Steps:
- [Planned action]
Stakeholder Notification
| Stakeholder | When to Notify | Method |
|---|---|---|
| Engineering | Immediately | Slack |
| Product | If user-impacting | Slack |
| Support | If customer-facing | |
| Leadership | If SEV1/SEV2 | Phone |
Post-Incident
Cleanup Tasks
- Remove any temporary fixes
- Update monitoring/alerts if needed
- Document any new learnings
Post-Incident Review
- Schedule post-mortem meeting
- Gather timeline and evidence
- Identify action items
Appendix
Related Runbooks
- [RB-XXX: Related Runbook 1]
- [RB-YYY: Related Runbook 2]
Reference Documentation
- [Link to architecture docs]
- [Link to service docs]
Revision History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | [Date] | [Name] | Initial version |
| 1.1 | [Date] | [Name] | [Changes] |
Incident Response Runbook Template
# Incident Runbook: [Alert Name]
| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |
---
## Alert Details
**Trigger Condition:**
```text
[Alert query/condition]
Example: error_rate > 1% for 5 minutes
Alert Meaning: [What this alert indicates]
False Positive Indicators: [Signs this might be a false alarm]
Immediate Actions (First 5 Minutes)
1. Acknowledge Alert
# Acknowledge in PagerDuty
pd incident:acknowledge
# Or via Slack
/pd ack
2. Assess Impact
Quick Health Checks:
# Check service status
curl -s https://api.example.com/health | jq .
# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR
# Check pod status
kubectl get pods -n production -l app=service
Impact Assessment:
| Check | Command | Expected | Actual |
|---|---|---|---|
| Health endpoint | curl /health |
200 OK | [Result] |
| Error rate | grep ERROR |
< 10 | [Result] |
| Pod status | kubectl get pods |
Running | [Result] |
3. Initial Communication
Post in #incidents:
🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]
Diagnosis
Common Causes and Checks
Cause 1: High Traffic
# Check request rate
kubectl top pods -n production -l app=service
# Check HPA status
kubectl get hpa -n production
If traffic spike confirmed:
- Scale replicas:
kubectl scale deployment/service --replicas=10 - Enable rate limiting if available
Cause 2: Database Issues
# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"
# Check slow queries
kubectl logs -l app=service | grep "slow query"
If database issues:
- Check connection pool exhaustion
- Look for long-running queries
- Consider read replica failover
Cause 3: Dependency Failure
# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .
# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"
If dependency failure:
- Verify external service status
- Check for timeout configuration
- Consider enabling fallback behavior
Resolution Steps
Quick Fixes
| Issue | Quick Fix | Command |
|---|---|---|
| Pod crash loop | Restart deployment | kubectl rollout restart deployment/service |
| Memory pressure | Increase limits | kubectl edit deployment/service |
| Config error | Rollback config | kubectl rollout undo deployment/service |
Rollback Procedure
# List recent deployments
kubectl rollout history deployment/service -n production
# Rollback to previous version
kubectl rollout undo deployment/service -n production
# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2
Resolution Verification
Verification Checklist:
- Alert has cleared
- Health checks passing
- Error rate below threshold
- No user complaints in support channels
- Metrics returning to baseline
Monitoring Period: Monitor for 15 minutes after resolution
Closure
Update Status
✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]
Post-Incident Tasks
- Update incident timeline
- Create post-mortem doc if SEV1/SEV2
- File tickets for follow-up work
- Update runbook if needed
Database Failover Runbook
# Runbook: Database Failover
| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |
---
## Overview
**Purpose:** Failover from primary database to replica when primary is unavailable.
**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover
**Expected Outcome:** Application traffic routed to new primary
**Estimated Duration:** 15-30 minutes
---
## Prerequisites
### Required Access
- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser
### Pre-Failover Checks
```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary
# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
"SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"
Acceptable lag: < 1MB
Failover Procedure
Step 1: Confirm Primary is Unavailable
# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"
# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"
Expected: Connection timeout or error state
Step 2: Notify Stakeholders
🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes
Step 3: Promote Replica
# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
--resource-group rg-prod \
--name pg-replica
# Verify promotion
az postgres flexible-server show \
--resource-group rg-prod \
--name pg-replica \
--query "replicationRole"
Expected: replicationRole: None (standalone)
Step 4: Update Connection Strings
# Update Kubernetes secret
kubectl create secret generic db-connection \
--from-literal=host=pg-replica.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production
Step 5: Verify Application Connectivity
# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database
# Test application health
curl -s https://api.example.com/health | jq .database
Post-Failover
Immediate Tasks
- Verify all applications connected to new primary
- Check for data consistency
- Monitor error rates
Recovery Tasks (Next 24 Hours)
- Investigate original primary failure
- Create new replica from new primary
- Update DNS/connection strings permanently
- Document incident and learnings
Rollback
If failover causes issues:
# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production
# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled
# Revert connection strings
kubectl create secret generic db-connection \
--from-literal=host=pg-primary.postgres.database.azure.com \
--dry-run=client -o yaml | kubectl apply -f -
# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production
Runbook Quality Checklist
| Criterion | Description | Check |
|---|---|---|
| Actionable | Every step has a specific action | [ ] |
| Testable | Can be practiced in non-prod | [ ] |
| Current | Reflects current system state | [ ] |
| Complete | Covers happy and error paths | [ ] |
| Accessible | Available during incidents | [ ] |
| Versioned | Changes tracked with dates | [ ] |
Workflow
When creating runbooks:
- Identify Need: What operation/incident needs documentation?
- Gather Information: Interview operators, review past incidents
- Draft Runbook: Use appropriate template
- Validate Steps: Walk through with subject matter expert
- Test in Non-Prod: Execute runbook in staging
- Publish: Add to runbook collection
- Train Team: Ensure operators know where to find it
- Maintain: Review and update regularly
References
For detailed guidance:
Last Updated: 2025-12-26