| name | backup-disaster-recovery |
| description | Implement backup strategies, disaster recovery plans, and data restoration procedures for protecting critical infrastructure and data. |
Backup and Disaster Recovery
Overview
Design and implement comprehensive backup and disaster recovery strategies to ensure data protection, business continuity, and rapid recovery from infrastructure failures.
When to Use
- Data protection and compliance
- Business continuity planning
- Disaster recovery planning
- Point-in-time recovery
- Cross-region failover
- Data migration
- Compliance and audit requirements
- Recovery time objective (RTO) optimization
Implementation Examples
1. Database Backup Configuration
# postgres-backup-cronjob.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: backup-script
namespace: databases
data:
backup.sh: |
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/backups/postgresql"
RETENTION_DAYS=30
DB_HOST="${POSTGRES_HOST}"
DB_PORT="${POSTGRES_PORT:-5432}"
DB_USER="${POSTGRES_USER}"
DB_PASSWORD="${POSTGRES_PASSWORD}"
export PGPASSWORD="$DB_PASSWORD"
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Full backup
BACKUP_FILE="$BACKUP_DIR/full-$(date +%Y%m%d-%H%M%S).sql"
echo "Starting backup to $BACKUP_FILE"
pg_dump -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" -v \
--format=plain --no-owner --no-privileges > "$BACKUP_FILE"
# Compress backup
gzip "$BACKUP_FILE"
echo "Backup compressed: ${BACKUP_FILE}.gz"
# Upload to S3
aws s3 cp "${BACKUP_FILE}.gz" \
"s3://my-backups/postgres/$(date +%Y/%m/%d)/"
# Clean local old backups
find "$BACKUP_DIR" -type f -mtime +7 -delete
# Verify backup
if pg_restore -d "postgresql://$DB_USER@$DB_HOST:$DB_PORT/test_restore" \
"${BACKUP_FILE}.gz" --single-transaction 2>/dev/null; then
echo "Backup verification successful"
dropdb -h "$DB_HOST" -p "$DB_PORT" -U "$DB_USER" test_restore
fi
echo "Backup complete: ${BACKUP_FILE}.gz"
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: databases
spec:
schedule: "0 2 * * *" # 2 AM daily
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
serviceAccountName: backup-sa
containers:
- name: backup
image: postgres:15-alpine
env:
- name: POSTGRES_HOST
valueFrom:
secretKeyRef:
name: postgres-credentials
key: host
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: postgres-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
volumeMounts:
- name: backup-script
mountPath: /backup
- name: backup-storage
mountPath: /backups
command:
- sh
- -c
- apk add --no-cache aws-cli && bash /backup/backup.sh
volumes:
- name: backup-script
configMap:
name: backup-script
defaultMode: 0755
- name: backup-storage
emptyDir:
sizeLimit: 100Gi
restartPolicy: OnFailure
2. Disaster Recovery Plan Template
# disaster-recovery-plan.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
name: dr-procedures
namespace: operations
data:
dr-runbook.md: |
# Disaster Recovery Runbook
## RTO and RPO Targets
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 1 hour
## Pre-Disaster Checklist
- [ ] Verify backups are current
- [ ] Test backup restoration process
- [ ] Verify DR site resources are provisioned
- [ ] Confirm failover DNS is configured
## Primary Region Failure
### Detection (0-15 minutes)
- Alerting system detects primary region down
- Incident commander declared
- War room opened in Slack #incidents
### Initial Actions (15-30 minutes)
- Verify primary region is truly down
- Check backup systems in secondary region
- Validate latest backup timestamp
### Failover Procedure (30 minutes - 2 hours)
1. Validate backup integrity
2. Restore database from latest backup
3. Update application configuration
4. Perform DNS failover to secondary region
5. Verify application health
### Recovery Steps
1. Restore from backup: `restore-backup.sh --backup-id=latest`
2. Update DNS: `aws route53 change-resource-record-sets --cli-input-json file://failover.json`
3. Verify: `curl https://myapp.com/health`
4. Run smoke tests
5. Monitor error rates and performance
## Post-Disaster
- Document timeline and RCA
- Update runbooks
- Schedule post-mortem
- Test backups again
---
apiVersion: v1
kind: Secret
metadata:
name: dr-credentials
namespace: operations
type: Opaque
stringData:
backup_aws_access_key: "AKIAIOSFODNN7EXAMPLE"
backup_aws_secret_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
dr_site_password: "secure-password-here"
3. Backup and Restore Script
#!/bin/bash
# backup-restore.sh - Complete backup and restore utilities
set -euo pipefail
BACKUP_BUCKET="s3://my-backups"
BACKUP_RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
# Backup function
backup_all() {
local environment=$1
log_info "Starting backup for $environment environment"
# Backup databases
log_info "Backing up databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${TIMESTAMP}.sql.gz"
pg_dump "$db" | gzip | aws s3 cp - "$backup_file"
log_info "Backed up $db to $backup_file"
done
# Backup Kubernetes resources
log_info "Backing up Kubernetes resources..."
kubectl get all,configmap,secret,ingress,pvc -A -o yaml | \
gzip | aws s3 cp - "$BACKUP_BUCKET/$environment/kubernetes-${TIMESTAMP}.yaml.gz"
log_info "Kubernetes resources backed up"
# Backup volumes
log_info "Backing up persistent volumes..."
for pvc in $(kubectl get pvc -A -o name); do
local pvc_name=$(echo $pvc | cut -d'/' -f2)
log_info "Backing up PVC: $pvc_name"
kubectl exec -n default -it backup-pod -- \
tar czf - /data | aws s3 cp - "$BACKUP_BUCKET/$environment/volumes/${pvc_name}-${TIMESTAMP}.tar.gz"
done
log_info "All backups completed successfully"
}
# Restore function
restore_all() {
local environment=$1
local backup_date=$2
log_warn "Restoring from backup date: $backup_date"
read -p "Are you sure? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
log_error "Restore cancelled"
exit 1
fi
# Restore databases
log_info "Restoring databases..."
for db in myapp_db analytics_db; do
local backup_file="$BACKUP_BUCKET/$environment/databases/${db}-${backup_date}.sql.gz"
log_info "Restoring $db from $backup_file"
aws s3 cp "$backup_file" - | gunzip | psql "$db"
done
# Restore Kubernetes resources
log_info "Restoring Kubernetes resources..."
local k8s_backup="$BACKUP_BUCKET/$environment/kubernetes-${backup_date}.yaml.gz"
aws s3 cp "$k8s_backup" - | gunzip | kubectl apply -f -
log_info "Restore completed successfully"
}
# Test restore
test_restore() {
local environment=$1
log_info "Testing restore procedure..."
# Get latest backup
local latest_backup=$(aws s3 ls "$BACKUP_BUCKET/$environment/databases/" | \
sort | tail -n 1 | awk '{print $4}')
if [ -z "$latest_backup" ]; then
log_error "No backups found"
exit 1
fi
log_info "Testing restore from: $latest_backup"
# Create test database
psql -c "CREATE DATABASE test_restore_$(date +%s);"
# Download and restore
aws s3 cp "$BACKUP_BUCKET/$environment/databases/$latest_backup" - | \
gunzip | psql "test_restore_$(date +%s)"
log_info "Test restore successful"
}
# List backups
list_backups() {
local environment=$1
log_info "Available backups for $environment:"
aws s3 ls "$BACKUP_BUCKET/$environment/" --recursive | grep -E "\.sql\.gz|\.yaml\.gz|\.tar\.gz"
}
# Cleanup old backups
cleanup_old_backups() {
local environment=$1
log_info "Cleaning up backups older than $BACKUP_RETENTION_DAYS days"
find "$BACKUP_BUCKET/$environment" -type f -mtime "+$BACKUP_RETENTION_DAYS" -delete
log_info "Cleanup completed"
}
# Main
main() {
case "${1:-}" in
backup)
backup_all "${2:-production}"
;;
restore)
restore_all "${2:-production}" "${3:-}"
;;
test)
test_restore "${2:-production}"
;;
list)
list_backups "${2:-production}"
;;
cleanup)
cleanup_old_backups "${2:-production}"
;;
*)
echo "Usage: $0 {backup|restore|test|list|cleanup} [environment] [backup-date]"
exit 1
;;
esac
}
main "$@"
4. Cross-Region Failover
# route53-failover.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: failover-config
namespace: operations
data:
failover.sh: |
#!/bin/bash
set -euo pipefail
PRIMARY_REGION="us-east-1"
SECONDARY_REGION="us-west-2"
DOMAIN="myapp.com"
HOSTED_ZONE_ID="Z1234567890ABC"
echo "Initiating failover to $SECONDARY_REGION"
# Get primary endpoint
PRIMARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$PRIMARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Get secondary endpoint
SECONDARY_ENDPOINT=$(aws elbv2 describe-load-balancers \
--region "$SECONDARY_REGION" \
--query 'LoadBalancers[0].DNSName' \
--output text)
# Update Route53 to failover
aws route53 change-resource-record-sets \
--hosted-zone-id "$HOSTED_ZONE_ID" \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$PRIMARY_ENDPOINT'",
"EvaluateTargetHealth": true
}
}
},
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "'$DOMAIN'",
"Type": "A",
"TTL": 60,
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "'$SECONDARY_ENDPOINT'",
"EvaluateTargetHealth": false
}
}
}
]
}'
echo "Failover completed"
Best Practices
✅ DO
- Perform regular backup testing
- Use multiple backup locations
- Implement automated backups
- Document recovery procedures
- Test failover procedures regularly
- Monitor backup completion
- Use immutable backups
- Encrypt backups at rest and in transit
❌ DON'T
- Rely on a single backup location
- Ignore backup failures
- Store backups with production data
- Skip testing recovery procedures
- Over-compress backups beyond recovery speed needs
- Forget to verify backup integrity
- Store encryption keys with backups
- Assume backups are automatically working