| name | runbook-generator |
| description | Эксперт по runbooks. Используй для создания операционных процедур, incident response и maintenance документации. |
Runbook Generator
Expert in creating comprehensive, standardized runbooks for operational procedures, incident response, and system maintenance tasks.
Runbook Structure
runbook_template:
metadata:
title: "Runbook title"
version: "1.0"
last_updated: "2024-01-15"
owner: "Team/Person"
reviewers: ["Name 1", "Name 2"]
overview:
purpose: "What this runbook accomplishes"
scope: "Systems/services affected"
audience: "Who should use this"
prerequisites:
access:
- "AWS Console access"
- "SSH key for production servers"
- "Database credentials"
tools:
- "kubectl configured"
- "AWS CLI installed"
- "jq for JSON parsing"
knowledge:
- "Basic Kubernetes concepts"
- "Understanding of service architecture"
execution:
estimated_time: "15-30 minutes"
risk_level: "Medium"
requires_change_ticket: true
requires_approval: true
can_be_automated: true
steps: [] # Detailed steps below
verification: [] # How to confirm success
rollback: [] # How to undo changes
troubleshooting: [] # Common issues
contacts:
primary_oncall: "PagerDuty"
escalation: "Engineering Manager"
subject_experts: ["DBA Team", "Platform Team"]
Standard Runbook Template
# [Runbook Title]
**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** Team Name
**Risk Level:** Low | Medium | High | Critical
## Overview
### Purpose
Brief description of what this runbook accomplishes.
### When to Use
- Trigger condition 1
- Trigger condition 2
- Alert: "Alert Name" fires
### Scope
Systems and services affected:
- Service A
- Database B
- External dependency C
## Prerequisites
### Required Access
- [ ] Production AWS Console
- [ ] Kubernetes cluster access
- [ ] Database read/write permissions
### Required Tools
```bash
# Verify kubectl
kubectl version --client
# Verify AWS CLI
aws sts get-caller-identity
# Verify database connectivity
psql -h $DB_HOST -U $DB_USER -c "SELECT 1"
Required Knowledge
- Kubernetes pod management
- Service architecture overview
- Incident response process
Pre-Execution Checklist
- Change ticket created: CHG-XXXXX
- Approval obtained from: [Name]
- Backup verified (if applicable)
- Stakeholders notified
- Maintenance window scheduled (if applicable)
Execution Steps
Step 1: [Action Name]
Purpose: Why this step is necessary
Command:
kubectl get pods -n production -l app=myservice
Expected Output:
NAME READY STATUS RESTARTS AGE
myservice-abc123-xyz 1/1 Running 0 2d
myservice-def456-uvw 1/1 Running 0 2d
Verification: Confirm all pods show STATUS=Running
If unexpected: See Troubleshooting section
Step 2: [Next Action]
Purpose: Description
Command:
# Command with explanation
kubectl scale deployment myservice --replicas=3 -n production
Expected Output:
deployment.apps/myservice scaled
Verification:
# Verify new replicas are running
kubectl get pods -n production -l app=myservice -w
Wait for: All 3 pods to show Running status (typically 2-5 minutes)
Post-Execution Verification
Verify Service Health
# Check deployment status
kubectl rollout status deployment/myservice -n production
# Check service endpoints
kubectl get endpoints myservice -n production
# Verify application health
curl -s https://api.example.com/health | jq .
Expected:
{
"status": "healthy",
"version": "1.2.3",
"uptime": "2h30m"
}
Verify Metrics
- Error rate returned to normal (<0.1%)
- Latency within SLA (<200ms p99)
- No new alerts firing
Rollback Procedure
When to Rollback
- Error rate exceeds 1%
- Latency exceeds 500ms p99
- Critical functionality broken
Rollback Steps
# Rollback to previous deployment
kubectl rollout undo deployment/myservice -n production
# Verify rollback
kubectl rollout status deployment/myservice -n production
# Confirm previous version
kubectl get deployment myservice -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
Troubleshooting
| Symptom | Likely Cause | Resolution |
|---|---|---|
| Pods stuck in Pending | Resource constraints | Check node capacity: kubectl describe nodes |
| CrashLoopBackOff | Application error | Check logs: kubectl logs -f <pod> |
| ImagePullBackOff | Registry auth issue | Verify secret: kubectl get secret regcred |
| Connection refused | Service not ready | Wait for readiness probe, check endpoints |
Common Issues
Issue: Deployment times out
# Check pod events
kubectl describe pod <pod-name> -n production
# Check resource limits
kubectl top pods -n production
Issue: Database connection failures
# Verify database connectivity
kubectl exec -it <pod> -n production -- psql -h $DB_HOST -c "SELECT 1"
# Check connection pool
kubectl logs <pod> -n production | grep -i "connection"
Emergency Contacts
| Role | Contact | When to Engage |
|---|---|---|
| On-call Engineer | PagerDuty | Any issue |
| Database Team | #dba-oncall | Database issues |
| Platform Team | #platform-oncall | Infrastructure issues |
| Engineering Manager | [Name] | Escalation |
Change Log
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-01-15 | Author | Initial version |
Related Documentation
## Runbook Types
### Incident Response Runbook
```yaml
incident_runbook:
sections:
detection:
alert_name: "High Error Rate - Payment Service"
threshold: "Error rate > 5% for 5 minutes"
severity: "P1"
immediate_actions:
- step: "Acknowledge alert"
command: "In PagerDuty, acknowledge incident"
time: "< 5 min"
- step: "Assess impact"
command: |
# Check error rate
curl -s "https://metrics.example.com/api/v1/query?query=rate(http_errors[5m])"
time: "< 2 min"
- step: "Notify stakeholders"
action: "Post in #incident-channel"
template: |
🚨 INCIDENT: Payment Service High Errors
Severity: P1
Status: Investigating
Impact: Payment processing affected
IC: @oncall
investigation:
- "Check recent deployments"
- "Review error logs"
- "Check dependent services"
- "Review infrastructure metrics"
mitigation:
options:
- name: "Rollback deployment"
when: "Error started after deploy"
command: "kubectl rollout undo deployment/payment -n prod"
- name: "Scale up"
when: "Load-related errors"
command: "kubectl scale deployment/payment --replicas=10 -n prod"
- name: "Enable circuit breaker"
when: "Downstream dependency failing"
command: "Toggle feature flag: payment.circuit_breaker=true"
resolution:
checklist:
- "[ ] Error rate < 0.1%"
- "[ ] No P1 alerts"
- "[ ] Stakeholders notified"
- "[ ] Incident documented"
Deployment Runbook
deployment_runbook:
pre_deployment:
checklist:
- "[ ] Code review approved"
- "[ ] CI/CD pipeline passed"
- "[ ] Staging tested"
- "[ ] Change ticket approved"
- "[ ] Rollback plan documented"
verification:
- step: "Verify staging health"
command: |
curl -s https://staging.example.com/health
- step: "Check deployment queue"
command: |
kubectl get pods -n staging -l app=myservice
deployment:
- step: "Apply deployment"
command: |
kubectl apply -f k8s/production/deployment.yaml
- step: "Monitor rollout"
command: |
kubectl rollout status deployment/myservice -n production --timeout=10m
- step: "Verify new version"
command: |
kubectl get deployment myservice -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}'
post_deployment:
- step: "Smoke test"
command: |
./scripts/smoke-test.sh production
- step: "Monitor metrics"
duration: "15 minutes"
watch:
- "Error rate"
- "Latency p99"
- "Request rate"
- step: "Update ticket"
action: "Mark CHG ticket as completed"
Maintenance Runbook
maintenance_runbook:
log_rotation:
schedule: "Weekly, Sunday 02:00 UTC"
steps:
- step: "Connect to server"
command: |
ssh admin@logs.example.com
- step: "Rotate logs"
command: |
sudo logrotate -f /etc/logrotate.d/application
- step: "Verify rotation"
command: |
ls -la /var/log/application/
# Should see rotated files with date suffix
- step: "Clean old logs"
command: |
# Remove logs older than 30 days
find /var/log/application/ -name "*.log.*" -mtime +30 -delete
- step: "Verify disk space"
command: |
df -h /var/log
# Should show > 20% free
database_maintenance:
schedule: "Monthly, first Sunday 03:00 UTC"
steps:
- step: "Check table sizes"
command: |
psql -c "
SELECT tablename,
pg_size_pretty(pg_total_relation_size(tablename::text))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(tablename::text) DESC
LIMIT 10;
"
- step: "Run VACUUM ANALYZE"
command: |
psql -c "VACUUM ANALYZE;"
- step: "Reindex if needed"
command: |
psql -c "REINDEX DATABASE mydb;"
Writing Guidelines
principles:
clarity:
- "Use active voice"
- "Be explicit, never assume"
- "One action per step"
completeness:
- "Include all commands"
- "Show expected output"
- "Document verification"
safety:
- "Test in non-prod first"
- "Include rollback steps"
- "Document risks"
formatting:
commands:
- "Use code blocks with language"
- "Include full paths"
- "Add comments for complex commands"
steps:
- "Number sequentially"
- "Include purpose"
- "Show expected result"
- "Note time estimate"
variables:
format: "$VARIABLE_NAME or <placeholder>"
document: "List all variables at start"
Quality Checklist
validation:
structure:
- "[ ] Clear title and metadata"
- "[ ] Prerequisites listed"
- "[ ] Steps numbered and clear"
- "[ ] Expected outputs included"
- "[ ] Verification steps present"
- "[ ] Rollback documented"
- "[ ] Troubleshooting section"
- "[ ] Contacts listed"
testing:
- "[ ] All commands tested"
- "[ ] Outputs verified"
- "[ ] Rollback tested"
- "[ ] Time estimates accurate"
maintenance:
- "[ ] Version number updated"
- "[ ] Change log maintained"
- "[ ] Quarterly review scheduled"
- "[ ] Owner assigned"
Лучшие практики
- Test everything — каждая команда должна быть проверена
- Show expected output — пользователь должен знать что увидит
- Include rollback — всегда план отката
- Keep updated — ревью каждый квартал
- Version control — runbooks в git
- Automate when possible — автоматизируй повторяющиеся процедуры