| name | ops-disaster-recovery |
| description | Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures. |
| trigger | - DR strategy development - DR plan review/update - DR testing/drills - Post-incident DR improvement |
| skip_when | - Day-to-day backup operations -> standard procedures - Application-level redundancy -> use ring-dev-team specialists - Single-instance failure recovery -> standard runbooks |
| related | [object Object] |
Disaster Recovery Workflow
This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.
DR Planning Phases
| Phase | Focus | Output |
|---|---|---|
| 1. Business Impact | Define criticality and requirements | BIA document |
| 2. Strategy Selection | Choose appropriate DR strategy | DR strategy |
| 3. Architecture Design | Design DR infrastructure | DR architecture |
| 4. Runbook Development | Document failover procedures | DR runbooks |
| 5. Testing | Validate DR capabilities | Test report |
| 6. Maintenance | Keep DR current | Update schedule |
Phase 1: Business Impact Analysis
Service Classification
Classify services by business criticality:
| Tier | Definition | RTO | RPO | Example Services |
|---|---|---|---|---|
| Tier 1 | Critical - business cannot operate | <15 min | <1 min | Payment processing |
| Tier 2 | Important - significant impact | <1 hour | <15 min | Customer portal |
| Tier 3 | Standard - moderate impact | <4 hours | <1 hour | Internal tools |
| Tier 4 | Low - minimal impact | <24 hours | <24 hours | Dev environments |
BIA Template
## Business Impact Analysis
**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]
### Service Classification
| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |
### Data Classification
| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |
### Dependencies
| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |
Phase 2: Strategy Selection
DR Strategy Comparison
| Strategy | RTO | RPO | Cost | Complexity | Best For |
|---|---|---|---|---|---|
| Backup & Restore | Hours | Hours | $ | Low | Tier 4 services |
| Pilot Light | 30-60 min | Minutes | $$ | Medium | Tier 3 services |
| Warm Standby | 10-30 min | Seconds-Minutes | $$$ | Medium-High | Tier 2 services |
| Hot Standby | <10 min | Seconds | $$$$ | High | Tier 1 services |
| Multi-Active | Near-zero | Near-zero | $$$$$ | Very High | Ultra-critical |
Strategy Selection Matrix
## DR Strategy Selection
### Requirements Summary
| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |
### Strategy Decision
**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]
**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]
### Trade-offs Accepted
| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |
Phase 3: Architecture Design
DR Architecture Components
| Component | Primary | DR | Replication |
|---|---|---|---|
| DNS | Route53 | Route53 | Global service |
| Load Balancer | ALB (us-east-1) | ALB (us-west-2) | Configuration sync |
| Compute | EKS (us-east-1) | EKS (us-west-2) | GitOps deployment |
| Database | Aurora (us-east-1) | Aurora Global (us-west-2) | Async replication |
| Storage | S3 (us-east-1) | S3 (us-west-2) | Cross-region replication |
| Secrets | Secrets Manager | Secrets Manager | Manual sync |
Architecture Diagram Template
Primary Region (us-east-1) DR Region (us-west-2)
┌─────────────────────────┐ ┌─────────────────────────┐
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ (standby) │
│ ┌────────┴────────┐ │ │ ┌────────┴────────┐ │
│ │ EKS Cluster │ │ │ │ EKS Cluster │ │
│ │ (Active) │ │ │ │ (Standby) │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ │
│ ┌────────┴────────┐ │ async │ ┌────────┴────────┐ │
│ │ Aurora │────┼────────►│ │ Aurora │ │
│ │ (Primary) │ │ │ │ (Replica) │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │
└─────────────────────────┘ └─────────────────────────┘
│ │
└───────────┬───────────────────┘
│
┌──────┴──────┐
│ Route53 │
│ (Global) │
└─────────────┘
Phase 4: Runbook Development
Failover Runbook Structure
## Failover Runbook: [Service Name]
**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]
### Pre-Conditions
- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready
### Failover Decision Criteria
| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |
### Failover Steps
1. **Verify DR Readiness** (2 min)
```bash
# Check DR database status
aws rds describe-db-clusters --region us-west-2
# Check EKS cluster status
kubectl --context=dr get nodes
Stop Writes to Primary (1 min)
# Scale down primary services kubectl --context=primary scale deployment/api --replicas=0Promote DR Database (5 min)
# Promote Aurora replica aws rds failover-global-cluster \ --global-cluster-identifier my-global-cluster \ --target-db-cluster-identifier dr-clusterActivate DR Services (2 min)
# Scale up DR services kubectl --context=dr scale deployment/api --replicas=10Update DNS (1-5 min propagation)
# Update Route53 health check aws route53 update-health-check \ --health-check-id xxx \ --disabledVerify Service (5 min)
# Health check curl https://api.example.com/health # Synthetic transaction ./scripts/synthetic-test.sh
Rollback Steps
[If failover causes issues, steps to return to primary]
Communication Template
Internal:
DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]
External (if customer-facing):
We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]
---
## Phase 5: Testing
### DR Test Types
| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |
### DR Test Template
```markdown
## DR Test Report
**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]
### Test Objectives
1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness
### Test Results
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |
### Timeline
| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |
### Issues Found
| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |
### Lessons Learned
1. [Lesson 1]
2. [Lesson 2]
### Action Items
| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |
Phase 6: Maintenance
DR Maintenance Schedule
| Activity | Frequency | Owner |
|---|---|---|
| Runbook review | Quarterly | Platform team |
| DR test | Per test schedule | SRE team |
| Replication monitoring | Daily (automated) | Monitoring |
| Cost review | Monthly | FinOps |
| Architecture review | Annually | Architecture team |
Anti-Rationalization Table
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "DR can be added later" | DR added later is rarely tested | DR is day-1 requirement |
| "Backups are good enough" | Backups != DR. RTO is hours vs minutes. | Design proper DR strategy |
| "Too expensive for DR" | DR cost << outage cost | Calculate business impact |
| "We'll figure it out during incident" | Panic != good decisions | Document runbooks NOW |
| "Tested last year, still good" | Systems change constantly | Test regularly |
Dispatch Specialist
For DR planning tasks, dispatch:
Task tool:
subagent_type: "infrastructure-architect"
model: "opus"
prompt: |
DR PLANNING REQUEST
Services: [services requiring DR]
RTO Requirement: [target]
RPO Requirement: [target]
Current State: [existing DR if any]
REQUEST: [design/review/test planning]