Claude Code Plugins

Community-maintained marketplace

Feedback

Author operational runbooks for incident response and troubleshooting

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name runbook-authoring
description Author operational runbooks for incident response and troubleshooting
allowed-tools Read, Glob, Grep, Write, Edit

Runbook Authoring Skill

When to Use This Skill

Use this skill when:

  • Runbook Authoring tasks - Working on author operational runbooks for incident response and troubleshooting
  • Planning or design - Need guidance on Runbook Authoring approaches
  • Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response and troubleshooting.

MANDATORY: Documentation-First Approach

Before authoring runbooks:

  1. Invoke docs-management skill for runbook patterns
  2. Verify SRE practices via MCP servers (perplexity)
  3. Base guidance on operational best practices

Runbook Purpose

RUNBOOK GOALS:

┌─────────────────────────────────────────────────────────────────┐
│                    WHY RUNBOOKS?                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  REDUCE MTTR (Mean Time To Recovery)                             │
│  ├── Pre-documented steps save diagnosis time                    │
│  ├── No need to figure it out during incident                    │
│  └── Consistent approach every time                              │
│                                                                  │
│  ENABLE ANYONE TO RESPOND                                        │
│  ├── On-call doesn't need to be expert                           │
│  ├── Knowledge transfer from senior to junior                    │
│  ├── Reduces bus factor                                          │
│  └── New team members can respond effectively                    │
│                                                                  │
│  DOCUMENT TRIBAL KNOWLEDGE                                       │
│  ├── Capture what experts know                                   │
│  ├── Make implicit knowledge explicit                            │
│  └── Preserve knowledge when people leave                        │
│                                                                  │
│  IMPROVE OVER TIME                                               │
│  ├── Each incident improves the runbook                          │
│  ├── Capture new failure modes                                   │
│  └── Evolve with the system                                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Runbook Types

RUNBOOK CATEGORIES:

ALERT RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per alert                                            │
│                                                                  │
│ Purpose: What to do when this specific alert fires               │
│ Linked from: Alert annotations                                   │
│ Contains: Triage, diagnosis, remediation for this alert          │
└─────────────────────────────────────────────────────────────────┘

SYMPTOM RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per symptom/problem                                  │
│                                                                  │
│ Purpose: How to diagnose/fix a type of problem                   │
│ Examples: "High Latency", "Out of Memory", "Connection Errors"   │
│ Contains: Decision tree for multiple potential causes            │
└─────────────────────────────────────────────────────────────────┘

PROCEDURE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per operational procedure                            │
│                                                                  │
│ Purpose: How to perform maintenance tasks                        │
│ Examples: "Database Failover", "Certificate Rotation"            │
│ Contains: Step-by-step procedures                                │
└─────────────────────────────────────────────────────────────────┘

SERVICE RUNBOOKS:
┌─────────────────────────────────────────────────────────────────┐
│ One runbook per service                                          │
│                                                                  │
│ Purpose: Service overview and common operations                  │
│ Contains: Architecture, dependencies, common issues              │
└─────────────────────────────────────────────────────────────────┘

Runbook Structure

RUNBOOK ANATOMY:

┌─────────────────────────────────────────────────────────────────┐
│ HEADER                                                           │
│ ├── Title, service, last updated                                 │
│ ├── Alert link (if alert runbook)                                │
│ └── Quick summary                                                │
├─────────────────────────────────────────────────────────────────┤
│ OVERVIEW                                                         │
│ ├── What this runbook covers                                     │
│ ├── When to use it                                               │
│ └── Expected outcome                                             │
├─────────────────────────────────────────────────────────────────┤
│ QUICK ACTIONS (Optimize for speed)                               │
│ ├── 2-3 most common fixes                                        │
│ ├── Copy-paste commands                                          │
│ └── "Try these first" section                                    │
├─────────────────────────────────────────────────────────────────┤
│ DIAGNOSIS                                                        │
│ ├── How to verify the problem                                    │
│ ├── Key metrics/logs to check                                    │
│ ├── Decision tree for root cause                                 │
│ └── Common causes and indicators                                 │
├─────────────────────────────────────────────────────────────────┤
│ REMEDIATION                                                      │
│ ├── Step-by-step fix for each cause                              │
│ ├── Rollback procedures                                          │
│ └── Verification steps                                           │
├─────────────────────────────────────────────────────────────────┤
│ ESCALATION                                                       │
│ ├── When to escalate                                             │
│ ├── Who to contact                                               │
│ └── What information to provide                                  │
├─────────────────────────────────────────────────────────────────┤
│ REFERENCES                                                       │
│ ├── Related dashboards                                           │
│ ├── Architecture docs                                            │
│ └── Related runbooks                                             │
└─────────────────────────────────────────────────────────────────┘

Alert Runbook Template

# Runbook: {Alert Name}

**Service:** {Service Name}
**Alert:** {Alert Name}
**Severity:** {P1/P2/P3/P4}
**Owner:** {Team/Person}

---

## Overview

**What this alert means:**
{One sentence explanation of what triggered this alert}

**User impact:**
{How users are affected when this fires}

**Expected resolution time:**
{Typical time to resolve}

---

## Quick Actions

> **Try these first before deep diagnosis**

### 1. Check if it's a known issue

```bash
# Check recent incidents
open https://status.example.com

# Check deployment history
kubectl rollout history deployment/orders-api -n production

2. Quick restart (if safe)

# Rolling restart (no downtime)
kubectl rollout restart deployment/orders-api -n production

# Wait for rollout
kubectl rollout status deployment/orders-api -n production

3. Rollback recent deployment

# Rollback to previous version
kubectl rollout undo deployment/orders-api -n production

Diagnosis

Step 1: Verify the alert is real

# Check current metric value
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])' | jq

Dashboard: Grafana - Orders API Overview

Step 2: Identify the scope

Check Command/Action
All pods affected? kubectl get pods -n production -l app=orders-api
All endpoints? Check per-endpoint error rate in Grafana
Started when? Check Grafana annotations for deployments

Step 3: Determine root cause

DECISION TREE:

Is it all pods or specific pods?
├── All pods → Likely code/config issue
│   ├── Recent deployment? → Rollback
│   └── No deployment? → Check dependencies
│
└── Specific pods → Likely infrastructure issue
    ├── Same node? → Node issue
    └── Random? → Check pod logs

Common Causes

Cause Indicators Solution
Bad deployment Errors started after deploy Rollback
Database issues Connection errors in logs Check DB
Memory pressure OOMKilled pods Scale up or fix leak
Dependency down Timeout errors Check dependency status

Remediation

Cause 1: Bad Deployment

Indicators:

  • Errors started immediately after deployment
  • New error types appearing

Fix:

# 1. Rollback to previous version
kubectl rollout undo deployment/orders-api -n production

# 2. Verify rollback
kubectl rollout status deployment/orders-api -n production

# 3. Check error rate dropping
watch -n5 'curl -s "http://prometheus:9090/..." | jq'

Verification:

  • Error rate returning to normal
  • No new 5xx errors in logs
  • Alert auto-resolves

Cause 2: Database Connection Issues

Indicators:

  • "Connection refused" or "Connection timeout" in logs
  • Database metrics showing high connections

Fix:

# 1. Check database status
psql -h db.example.com -U admin -c "SELECT count(*) FROM pg_stat_activity"

# 2. If connection pool exhausted, restart pods
kubectl rollout restart deployment/orders-api -n production

# 3. If database is down, escalate to DBA

Cause 3: Memory Pressure

Indicators:

  • OOMKilled in pod events
  • Memory usage climbing before crash

Fix:

# 1. Check for OOMKilled
kubectl get events -n production --field-selector reason=OOMKilled

# 2. Increase memory limit temporarily
kubectl set resources deployment/orders-api -n production \
  --limits=memory=2Gi

# 3. File ticket for memory leak investigation

Escalation

When to escalate

  • Quick actions didn't resolve
  • Root cause unclear after 15 minutes
  • Data loss or corruption suspected
  • Multiple services affected

Who to contact

Situation Contact Method
Database issues DBA on-call PagerDuty: dba-oncall
Network issues Platform team Slack: #platform-oncall
Security concern Security team PagerDuty: security
Unknown Engineering Manager Phone: XXX-XXX-XXXX

Information to provide

When escalating, include:

  • Alert name and time started
  • Actions already taken
  • Current hypothesis
  • Link to incident channel

References


Revision History

Date Change Author
2024-01-15 Added memory pressure section @engineer
2024-01-01 Initial version @oncall

Procedure Runbook Template

# Procedure: {Procedure Name}

**Purpose:** {What this procedure accomplishes}
**When to use:** {Circumstances requiring this procedure}
**Duration:** {Expected time to complete}
**Risk Level:** {Low/Medium/High}

---

## Prerequisites

- [ ] Access to production Kubernetes cluster
- [ ] Database admin credentials
- [ ] Approval from {approver} if during business hours

## Pre-Flight Checks

```bash
# Verify you have correct access
kubectl auth can-i '*' '*' -n production

# Check current system state
kubectl get pods -n production

# Confirm no ongoing incidents
open https://status.example.com

Procedure Steps

Step 1: {First Step Title}

Purpose: {Why this step}

# Commands to execute
{command}

Expected output:

{expected output}

Verification:

  • {Check to confirm step succeeded}

Step 2: {Second Step Title}

Purpose: {Why this step}

{command}

Expected output:

{expected output}

Verification:

  • {Check to confirm step succeeded}

Step 3: {Third Step Title}

...


Rollback Procedure

If something goes wrong, follow these steps to restore previous state:

Step 1: {Rollback Step}

{rollback command}

Step 2: {Verify Rollback}

{verification command}

Post-Procedure Verification

  • Service is healthy (check dashboard)
  • No new errors in logs
  • Dependent services unaffected
  • Monitoring shows expected state

Cleanup

# Remove temporary resources
{cleanup commands}

Troubleshooting

Issue: {Common Issue 1}

Symptom: {What you see} Cause: {Why it happens} Solution: {How to fix}

Issue: {Common Issue 2}

...


Template References

  • {Related documentation}
  • {Architecture diagrams}

Runbook Quality Checklist

# Runbook Quality Checklist

## Structure
- [ ] Clear title and service identification
- [ ] Last updated date is recent
- [ ] Owner/team identified
- [ ] Severity/risk level stated

## Quick Actions
- [ ] 2-3 most common fixes at the top
- [ ] Commands are copy-paste ready
- [ ] No unnecessary explanation before actions

## Diagnosis
- [ ] Decision tree for root cause
- [ ] Key metrics/logs to check
- [ ] Links to dashboards
- [ ] Common causes listed

## Remediation
- [ ] Step-by-step for each cause
- [ ] Verification after each fix
- [ ] Rollback procedures included
- [ ] Commands are tested and work

## Escalation
- [ ] Clear escalation criteria
- [ ] Contact information current
- [ ] Information to provide when escalating

## Usability
- [ ] Can be followed at 3am under stress
- [ ] No jargon without explanation
- [ ] Links are not broken
- [ ] Screenshots where helpful

Workflow

When authoring runbooks:

  1. Start from Incidents: Create runbooks after incidents
  2. Optimize for Speed: Quick actions first
  3. Include Commands: Copy-paste ready
  4. Add Decision Trees: Help diagnose root cause
  5. Define Escalation: When and who
  6. Test Regularly: Verify runbooks work
  7. Update After Incidents: Improve based on learnings

Further Reading

For detailed guidance:


Last Updated: 2025-12-26