Claude Code Plugins

Community-maintained marketplace

Feedback

Operational runbook templates for incident response and procedures

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name runbook-creation
description Operational runbook templates for incident response and procedures
allowed-tools Read, Glob, Grep, Write, Edit

Runbook Creation Skill

When to Use This Skill

Use this skill when:

  • Runbook Creation tasks - Working on operational runbook templates for incident response and procedures
  • Planning or design - Need guidance on Runbook Creation approaches
  • Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response, maintenance procedures, and operational tasks.

MANDATORY: Documentation-First Approach

Before creating runbooks:

  1. Invoke docs-management skill for runbook patterns
  2. Verify SRE best practices via MCP servers (perplexity)
  3. Base guidance on Google SRE principles

Runbook Types

Runbook Categories:

┌─────────────────────────────────────────────────────────────────────────────┐
│  Incident Response Runbooks                                                  │
│  • Alert-triggered procedures                                                │
│  • Escalation paths                                                          │
│  • Communication templates                                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  Operational Runbooks                                                        │
│  • Deployment procedures                                                     │
│  • Maintenance tasks                                                         │
│  • Backup/restore operations                                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  Troubleshooting Runbooks                                                    │
│  • Diagnostic procedures                                                     │
│  • Common issue resolution                                                   │
│  • Debug workflows                                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│  Emergency Runbooks                                                          │
│  • Disaster recovery                                                         │
│  • Security incident response                                                │
│  • Business continuity                                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Standard Runbook Template

# Runbook: [TITLE]

| Property | Value |
|----------|-------|
| **ID** | RB-[NUMBER] |
| **Category** | [Incident/Operational/Troubleshooting/Emergency] |
| **Service** | [Service Name] |
| **Owner** | [Team/Individual] |
| **Last Updated** | [YYYY-MM-DD] |
| **Last Tested** | [YYYY-MM-DD] |
| **Review Frequency** | [Quarterly/Monthly/Annually] |

---

## Overview

**Purpose:** [What this runbook helps you accomplish]

**When to Use:** [Conditions that trigger this runbook]

**Expected Outcome:** [What success looks like]

**Estimated Duration:** [Time to complete]

---

## Prerequisites

### Required Access

- [ ] [System/Tool 1] - [Role/Permission needed]
- [ ] [System/Tool 2] - [Role/Permission needed]

### Required Knowledge

- [Skill/Knowledge 1]
- [Skill/Knowledge 2]

### Tools Needed

| Tool | Purpose | Access URL |
|------|---------|------------|
| [Tool 1] | [Purpose] | [URL/Link] |
| [Tool 2] | [Purpose] | [URL/Link] |

---

## Quick Reference

```text
Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace]          │
│ View logs: kubectl logs -f [pod-name] -n [namespace]           │
│ Restart service: kubectl rollout restart deployment/[name]     │
│ Check metrics: [monitoring-url]                                │
└────────────────────────────────────────────────────────────────┘

Procedure

Step 1: [Step Name]

Objective: [What this step accomplishes]

Actions:

  1. [Action 1]

    # Command example
    kubectl get pods -n production
    
  2. [Action 2]

Expected Result: [What you should see]

If This Fails: Go to Troubleshooting Section


Step 2: [Step Name]

Objective: [What this step accomplishes]

Actions:

  1. [Action 1]
  2. [Action 2]

Decision Point:

┌─────────────────────────────────────┐
│ Is the service responding?          │
│                                     │
│ YES → Continue to Step 3            │
│ NO  → Go to Step 4 (Escalation)     │
└─────────────────────────────────────┘

Step 3: [Verification]

Objective: Verify the issue is resolved

Verification Checklist:

  • Service is responding to health checks
  • Metrics show normal values
  • No new errors in logs
  • Users can access the service

Troubleshooting

Issue: [Common Issue 1]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

  1. [Step 1]
  2. [Step 2]

Issue: [Common Issue 2]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

  1. [Step 1]
  2. [Step 2]

Escalation

When to Escalate

  • Issue not resolved after [X] minutes
  • Impact affects [threshold]
  • Required access not available
  • Unsure of next steps

Escalation Path

Level Contact Method Response Time
L1 On-call Engineer PagerDuty 15 min
L2 Team Lead Slack #incidents 30 min
L3 Engineering Manager Phone 1 hour
L4 VP Engineering Phone As needed

Communication

Status Updates

Template:

[TIMESTAMP] - [SERVICE] - [STATUS]

Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Planned action]

Stakeholder Notification

Stakeholder When to Notify Method
Engineering Immediately Slack
Product If user-impacting Slack
Support If customer-facing Email
Leadership If SEV1/SEV2 Phone

Post-Incident

Cleanup Tasks

  • Remove any temporary fixes
  • Update monitoring/alerts if needed
  • Document any new learnings

Post-Incident Review

  • Schedule post-mortem meeting
  • Gather timeline and evidence
  • Identify action items

Appendix

Related Runbooks

  • [RB-XXX: Related Runbook 1]
  • [RB-YYY: Related Runbook 2]

Reference Documentation

  • [Link to architecture docs]
  • [Link to service docs]

Revision History

Version Date Author Changes
1.0 [Date] [Name] Initial version
1.1 [Date] [Name] [Changes]

Incident Response Runbook Template

# Incident Runbook: [Alert Name]

| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |

---

## Alert Details

**Trigger Condition:**
```text

[Alert query/condition]
Example: error_rate > 1% for 5 minutes

Alert Meaning: [What this alert indicates]

False Positive Indicators: [Signs this might be a false alarm]


Immediate Actions (First 5 Minutes)

1. Acknowledge Alert

# Acknowledge in PagerDuty
pd incident:acknowledge

# Or via Slack
/pd ack

2. Assess Impact

Quick Health Checks:

# Check service status
curl -s https://api.example.com/health | jq .

# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR

# Check pod status
kubectl get pods -n production -l app=service

Impact Assessment:

Check Command Expected Actual
Health endpoint curl /health 200 OK [Result]
Error rate grep ERROR < 10 [Result]
Pod status kubectl get pods Running [Result]

3. Initial Communication

Post in #incidents:

🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]

Diagnosis

Common Causes and Checks

Cause 1: High Traffic

# Check request rate
kubectl top pods -n production -l app=service

# Check HPA status
kubectl get hpa -n production

If traffic spike confirmed:

  • Scale replicas: kubectl scale deployment/service --replicas=10
  • Enable rate limiting if available

Cause 2: Database Issues

# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check slow queries
kubectl logs -l app=service | grep "slow query"

If database issues:

  • Check connection pool exhaustion
  • Look for long-running queries
  • Consider read replica failover

Cause 3: Dependency Failure

# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .

# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"

If dependency failure:

  • Verify external service status
  • Check for timeout configuration
  • Consider enabling fallback behavior

Resolution Steps

Quick Fixes

Issue Quick Fix Command
Pod crash loop Restart deployment kubectl rollout restart deployment/service
Memory pressure Increase limits kubectl edit deployment/service
Config error Rollback config kubectl rollout undo deployment/service

Rollback Procedure

# List recent deployments
kubectl rollout history deployment/service -n production

# Rollback to previous version
kubectl rollout undo deployment/service -n production

# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2

Resolution Verification

Verification Checklist:

  • Alert has cleared
  • Health checks passing
  • Error rate below threshold
  • No user complaints in support channels
  • Metrics returning to baseline

Monitoring Period: Monitor for 15 minutes after resolution


Closure

Update Status

✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]

Post-Incident Tasks

  • Update incident timeline
  • Create post-mortem doc if SEV1/SEV2
  • File tickets for follow-up work
  • Update runbook if needed

Database Failover Runbook

# Runbook: Database Failover

| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |

---

## Overview

**Purpose:** Failover from primary database to replica when primary is unavailable.

**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover

**Expected Outcome:** Application traffic routed to new primary

**Estimated Duration:** 15-30 minutes

---

## Prerequisites

### Required Access

- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser

### Pre-Failover Checks

```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary

# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
  "SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"

Acceptable lag: < 1MB


Failover Procedure

Step 1: Confirm Primary is Unavailable

# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"

# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"

Expected: Connection timeout or error state

Step 2: Notify Stakeholders

🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes

Step 3: Promote Replica

# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
  --resource-group rg-prod \
  --name pg-replica

# Verify promotion
az postgres flexible-server show \
  --resource-group rg-prod \
  --name pg-replica \
  --query "replicationRole"

Expected: replicationRole: None (standalone)

Step 4: Update Connection Strings

# Update Kubernetes secret
kubectl create secret generic db-connection \
  --from-literal=host=pg-replica.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production

Step 5: Verify Application Connectivity

# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database

# Test application health
curl -s https://api.example.com/health | jq .database

Post-Failover

Immediate Tasks

  • Verify all applications connected to new primary
  • Check for data consistency
  • Monitor error rates

Recovery Tasks (Next 24 Hours)

  • Investigate original primary failure
  • Create new replica from new primary
  • Update DNS/connection strings permanently
  • Document incident and learnings

Rollback

If failover causes issues:

# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production

# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled

# Revert connection strings
kubectl create secret generic db-connection \
  --from-literal=host=pg-primary.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production

Runbook Quality Checklist

Criterion Description Check
Actionable Every step has a specific action [ ]
Testable Can be practiced in non-prod [ ]
Current Reflects current system state [ ]
Complete Covers happy and error paths [ ]
Accessible Available during incidents [ ]
Versioned Changes tracked with dates [ ]

Workflow

When creating runbooks:

  1. Identify Need: What operation/incident needs documentation?
  2. Gather Information: Interview operators, review past incidents
  3. Draft Runbook: Use appropriate template
  4. Validate Steps: Walk through with subject matter expert
  5. Test in Non-Prod: Execute runbook in staging
  6. Publish: Add to runbook collection
  7. Train Team: Ensure operators know where to find it
  8. Maintain: Review and update regularly

References

For detailed guidance:


Last Updated: 2025-12-26