Claude Code Plugins

Community-maintained marketplace

Feedback

runbooks-troubleshooting-guides

@TheBushidoCollective/han
38
0

Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name runbooks-troubleshooting-guides
description Use when creating troubleshooting guides and diagnostic procedures for operational issues. Covers problem diagnosis, root cause analysis, and systematic debugging.
allowed-tools Read, Write, Edit, Bash, Grep, Glob

Runbooks - Troubleshooting Guides

Creating effective troubleshooting guides for diagnosing and resolving operational issues.

Troubleshooting Framework

The 5-Step Method

  1. Observe - Gather symptoms and data
  2. Hypothesize - Form theories about root cause
  3. Test - Validate hypotheses with experiments
  4. Fix - Apply solution
  5. Verify - Confirm resolution

Basic Troubleshooting Guide

# Troubleshooting: [Problem Statement]

## Symptoms

What the user/system is experiencing:
- API returning 503 errors
- Response time > 10 seconds
- High CPU usage alerts

## Quick Checks (< 2 minutes)

### 1. Is the service running?
```bash
kubectl get pods -n production | grep api-server

Expected: STATUS = Running

2. Are recent deploys the cause?

kubectl rollout history deployment/api-server

Check: Did we deploy in the last 30 minutes?

3. Is this affecting all users?

Check error rate in Datadog:

  • If < 5%: Isolated issue, may be client-specific
  • If > 50%: Widespread issue, likely infrastructure

Common Causes

Symptom Likely Cause Quick Fix
503 errors Pod crashlooping Restart deployment
Slow responses Database connection pool Increase pool size
High memory Memory leak Restart pods

Detailed Diagnosis

Hypothesis 1: Database Connection Issues

Test:

# Check database connections
kubectl exec -it api-server-abc -- psql -h $DB_HOST -c "SELECT count(*) FROM pg_stat_activity"

If connections > 90: Pool is saturated. Next step: Increase pool size or investigate slow queries.

Hypothesis 2: High Traffic Spike

Test:

# Check request rate
curl -H "Authorization: Bearer $DD_API_KEY" \
  "https://api.datadoghq.com/api/v1/query?query=sum:nginx.requests{*}"

If requests 3x normal: Traffic spike. Next step: Scale up pods or enable rate limiting.

Hypothesis 3: External Service Degradation

Test:

# Check third-party API
curl -w "@curl-format.txt" https://api.stripe.com/v1/charges

If response time > 2s: External service slow. Next step: Implement circuit breaker or increase timeouts.

Resolution Steps

Solution A: Immediate (< 5 minutes)

Restart affected pods:

kubectl rollout restart deployment/api-server -n production

When to use: Quick mitigation while investigating root cause.

Solution B: Short-term (< 30 minutes)

Scale up resources:

kubectl scale deployment/api-server --replicas=10 -n production

When to use: Traffic spike or resource exhaustion.

Solution C: Long-term (< 2 hours)

Fix root cause:

  1. Identify slow database query
  2. Add database index
  3. Deploy code optimization

When to use: After immediate pressure is relieved.

Validation

  • Error rate < 1%
  • Response time p95 < 200ms
  • CPU usage < 70%
  • No active alerts

Prevention

How to prevent this issue in the future:

  • Add monitoring alert for connection pool saturation
  • Implement auto-scaling based on request rate
  • Set up load testing to find capacity limits

## Decision Tree Format

```markdown
# Troubleshooting: Slow API Responses

## Start Here
                Check response time
                       |
        ┌──────────────┴──────────────┐
        │                             │
    < 500ms                       > 500ms
        │                             │
   NOT THIS RUNBOOK            Continue below

## Step 1: Locate the Slowness

```bash
# Check which service is slow
curl -w "@timing.txt" https://api.example.com/users

Decision:

  • Time to first byte > 2s → Database slow (go to Step 2)
  • Time to first byte < 100ms → Network slow (go to Step 3)
  • Timeout → Service down (go to Step 4)

Step 2: Database Diagnosis

# Check active queries
psql -c "SELECT query, state, query_start FROM pg_stat_activity WHERE state != 'idle'"

Decision:

  • Query running > 5s → Slow query (Solution A)
  • Many idle in transaction → Connection leak (Solution B)
  • High connection count → Pool exhausted (Solution C)

Solution A: Optimize Slow Query

  1. Identify slow query from above
  2. Run EXPLAIN ANALYZE
  3. Add missing index or optimize query

Solution B: Fix Connection Leak

  1. Restart application pods
  2. Review code for unclosed connections
  3. Add connection timeout

Solution C: Increase Connection Pool

  1. Edit database config
  2. Increase max_connections
  3. Update application pool size

Step 3: Network Diagnosis

... (continue with network troubleshooting)


## Layered Troubleshooting

### Layer 1: Application

```markdown
## Application Layer Issues

### Check Application Health

1. **Health endpoint:**
   ```bash
   curl https://api.example.com/health
  1. Application logs:

    kubectl logs deployment/api-server --tail=100 | grep ERROR
    
  2. Application metrics:

    • Request rate
    • Error rate
    • Response time percentiles

Common Application Issues

Memory Leak

  • Symptom: Memory usage climbing over time
  • Test: Check memory metrics in Datadog
  • Fix: Restart pods, investigate with heap dump

Thread Starvation

  • Symptom: Slow responses, high CPU
  • Test: Thread dump analysis
  • Fix: Increase thread pool size

Code Bug

  • Symptom: Specific endpoints fail
  • Test: Review recent deploys
  • Fix: Rollback or hotfix

### Layer 2: Infrastructure

```markdown
## Infrastructure Layer Issues

### Check Infrastructure Health

1. **Node resources:**
   ```bash
   kubectl top nodes
  1. Pod resources:

    kubectl top pods -n production
    
  2. Network connectivity:

    kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never -- ping database.internal
    

Common Infrastructure Issues

Node Under Pressure

  • Symptom: Pods evicted, slow scheduling
  • Test: kubectl describe node for pressure conditions
  • Fix: Scale node pool or add nodes

Network Partition

  • Symptom: Intermittent timeouts
  • Test: MTR between pods and destination
  • Fix: Check security groups, routing tables

Disk I/O Saturation

  • Symptom: Slow database, high latency
  • Test: Check IOPS metrics in CloudWatch
  • Fix: Increase provisioned IOPS

### Layer 3: External Dependencies

```markdown
## External Dependencies Issues

### Check External Services

1. **Third-party APIs:**
   ```bash
   curl -w "@timing.txt" https://api.stripe.com/health
  1. Status pages:

    • Check status.stripe.com
    • Check status.aws.amazon.com
  2. DNS resolution:

    nslookup api.stripe.com
    dig api.stripe.com
    

Common External Issues

API Rate Limiting

  • Symptom: 429 responses from external service
  • Test: Check rate limit headers
  • Fix: Implement backoff, cache responses

Service Degradation

  • Symptom: Slow external API responses
  • Test: Check their status page
  • Fix: Implement circuit breaker, use fallback

DNS Failure

  • Symptom: Cannot resolve hostname
  • Test: DNS queries
  • Fix: Check DNS config, try alternative resolver

## Systematic Debugging

### Use the Scientific Method

```markdown
# Debugging: Database Connection Failures

## 1. Observation

**What we know:**
- Error: "connection refused" in logs
- Started: 2025-01-15 14:30 UTC
- Frequency: Every database query fails
- Scope: All pods affected

## 2. Hypothesis

**Possible causes:**
1. Database instance is down
2. Security group blocking traffic
3. Network partition
4. Wrong credentials

## 3. Test Each Hypothesis

### Test 1: Database instance status

```bash
aws rds describe-db-instances --db-instance-identifier prod-db | jq '.DBInstances[0].DBInstanceStatus'

Result: "available" Conclusion: Database is running ✗ Hypothesis 1 rejected

Test 2: Security group rules

aws ec2 describe-security-groups --group-ids sg-abc123 | jq '.SecurityGroups[0].IpPermissions'

Result: Port 5432 open only to 10.0.0.0/16 Pod IP: 10.1.0.5 Conclusion: Pod IP not in allowed range ✓ ROOT CAUSE FOUND

4. Fix

Update security group:

aws ec2 authorize-security-group-ingress \
  --group-id sg-abc123 \
  --protocol tcp \
  --port 5432 \
  --cidr 10.1.0.0/16

5. Verify

Test connection from pod:

kubectl exec -it api-server-abc -- psql -h prod-db.rds.amazonaws.com -c "SELECT 1"

Result: Success ✓


## Time-Boxed Investigation

```markdown
# Troubleshooting: Production Outage

**Time Box:** Spend MAX 15 minutes investigating before escalating.

## First 5 Minutes: Quick Wins

- [ ] Check pod status
- [ ] Check recent deploys
- [ ] Check external status pages
- [ ] Review monitoring dashboards

**If issue persists:** Continue to next phase.

## Minutes 5-10: Common Causes

- [ ] Restart pods (quick mitigation)
- [ ] Check database connectivity
- [ ] Review application logs
- [ ] Check resource limits

**If issue persists:** Continue to next phase.

## Minutes 10-15: Deep Dive

- [ ] Enable debug logging
- [ ] Capture thread dump
- [ ] Check for memory leaks
- [ ] Review network traces

**If issue persists:** ESCALATE to senior engineer.

## Escalation

**Escalate to:** Platform Team Lead
**Provide:**
- Timeline of issue
- Tests performed
- Current error rate
- Mitigation attempts

Common Troubleshooting Patterns

Binary Search

## Finding Which Service is Slow

Using binary search to narrow down the problem:

1. **Check full request:** 5000ms total
2. **Check first half (API → Database):** 4900ms
   → Problem is in database query
3. **Check database:** Query takes 4800ms
4. **Check query plan:** Sequential scan on large table
5. **Root cause:** Missing index

**Fix:** Add index on frequently queried column.

Correlation Analysis

## Finding Related Events

Look for patterns and correlations:

**Timeline:**
- 14:25 - Deploy completed
- 14:30 - Error rate spike
- 14:35 - Database CPU at 100%
- 14:40 - Requests timing out

**Correlation:** Deploy introduced N+1 query.

**Evidence:**
- No config changes
- No infrastructure changes
- Only code deploy
- Error coincides with deploy

**Action:** Rollback deploy.

Anti-Patterns

Don't Skip Obvious Checks

# Bad: Jump to complex solutions
## Database Slow

Must be a query optimization issue. Let's analyze query plans...

# Good: Check basics first
## Database Slow

1. Is the database actually running?
2. Can we connect to it?
3. Are there any locks?
4. What does the slow query log show?

Don't Guess Randomly

# Bad: Random changes
## API Errors

Let's try:
- Restarting the database
- Scaling to 100 pods
- Changing the load balancer config
- Updating the kernel

# Good: Systematic approach
## API Errors

1. What is the actual error message?
2. When did it start?
3. What changed before it started?
4. Can we reproduce it?

Don't Skip Documentation

# Bad: No notes
## Fixed It

I restarted some pods and now it works.

# Good: Document findings
## Resolution

**Root Cause:** Memory leak in worker process
**Evidence:** Pod memory climbing linearly over 6 hours
**Temporary Fix:** Restarted pods
**Long-term Fix:** PR #1234 fixes memory leak
**Prevention:** Added memory usage alerts

Related Skills

  • runbook-structure: Organizing operational documentation
  • incident-response: Handling production incidents