name	backup-etcd
description	Create and manage etcd backups for OpenShift control plane disaster recovery, including automated backup procedures and verification.

Backup etcd

This skill provides comprehensive guidance for backing up etcd, the critical key-value store for OpenShift/Kubernetes control plane data. Proper etcd backups are essential for cluster disaster recovery.

When to Use This Skill

Before Major Changes: Prior to cluster upgrades, configuration changes, or major updates
Regular Maintenance: As part of routine backup procedures
Disaster Recovery Planning: Setting up automated etcd backup schedules
Pre-Migration: Before cluster migrations or infrastructure changes
Compliance: Meeting backup retention requirements
Emergency Situations: When cluster issues detected, capture current state

What This Skill Does

Validates Cluster Health: Checks etcd cluster status before backup
Creates etcd Snapshot: Generates consistent etcd data snapshot
Stores Backup Safely: Copies snapshot to secure off-cluster location
Verifies Backup: Validates snapshot integrity and completeness
Manages Retention: Implements backup rotation and cleanup
Documents Metadata: Records backup metadata for recovery procedures

How to Use

Basic Usage

Create an etcd backup

Backup etcd and store in S3

Automated Usage

Set up automated daily etcd backups

Create etcd backup CronJob with retention policy

Emergency Usage

Create emergency etcd snapshot before cluster recovery

Prerequisites

Before backing up etcd:

Cluster admin access with oc CLI configured
At least one healthy master node accessible
Sufficient disk space on master nodes (typically 2-3x etcd data size)
SSH access to master nodes (for manual backups)
Off-cluster storage location configured (for backup copies)
Understanding of etcd cluster topology

Examples

Example 1: Manual etcd Backup on Single Master

User: "Create manual etcd backup"

Skill Actions:

Check etcd cluster health:

# Verify all etcd members are healthy
oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}{.reason}{"\n"}{.status}{"\n"}{end}'

# Check etcd pods
oc get pods -n openshift-etcd -l app=etcd

# Verify etcd endpoint health
ETCD_POD=$(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[0].metadata.name}')
oc exec -n openshift-etcd $ETCD_POD -- etcdctl endpoint health -w table

Access master node and create backup:

# Get first master node name
MASTER_NODE=$(oc get nodes -l node-role.kubernetes.io/master -o jsonpath='{.items[0].metadata.name}')

# Start debug session on master node
oc debug node/$MASTER_NODE

# Once in debug pod, run backup
chroot /host

# Create backup directory
mkdir -p /home/core/etcd-backups

# Run cluster backup script
/usr/local/bin/cluster-backup.sh /home/core/etcd-backups

# Exit chroot and debug pod
exit
exit

Verify backup created:

# List backups on master node
oc debug node/$MASTER_NODE -- chroot /host ls -lh /home/core/etcd-backups/

Copy backup off cluster:

# Create local directory
mkdir -p ./etcd-backups/$(date +%Y%m%d)

# Copy backup files
oc debug node/$MASTER_NODE -- chroot /host tar czf - /home/core/etcd-backups/ | tar xzf - -C ./etcd-backups/$(date +%Y%m%d)/

# Verify files copied
ls -lh ./etcd-backups/$(date +%Y%m%d)/

Expected Output:

Backup created at: /home/core/etcd-backups/snapshot_2025-11-17_152030.db
Static pod resources backed up to: /home/core/etcd-backups/static_kuberesources_2025-11-17_152030.tar.gz

Success Indicators:

✅ snapshot_*.db file created (etcd data)
✅ static_kuberesources_*.tar.gz created (static pod manifests)
✅ No errors in backup script output
✅ Backup files copied off cluster successfully

Example 2: Automated etcd Backup with CronJob

User: "Set up automated daily etcd backups at 2 AM"

Skill Actions:

Create etcd backup script ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: etcd-backup-script
  namespace: openshift-etcd
data:
  backup-etcd.sh: |
    #!/bin/bash
    set -euo pipefail

    BACKUP_DIR="/host/home/core/etcd-backups"
    BACKUP_DATE=$(date +%Y%m%d-%H%M%S)
    RETENTION_DAYS=7

    echo "Starting etcd backup at $BACKUP_DATE"

    # Create backup directory
    mkdir -p $BACKUP_DIR

    # Run backup
    /usr/local/bin/cluster-backup.sh $BACKUP_DIR

    echo "Backup completed successfully"

    # Clean up old backups
    find $BACKUP_DIR -name "snapshot_*.db" -mtime +$RETENTION_DAYS -delete
    find $BACKUP_DIR -name "static_kuberesources_*.tar.gz" -mtime +$RETENTION_DAYS -delete

    echo "Old backups cleaned up (retention: $RETENTION_DAYS days)"

    # List current backups
    echo "Current backups:"
    ls -lh $BACKUP_DIR/

Create ServiceAccount with required permissions:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: etcd-backup
  namespace: openshift-etcd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: etcd-backup
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: etcd-backup
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: etcd-backup
subjects:
  - kind: ServiceAccount
    name: etcd-backup
    namespace: openshift-etcd

Create CronJob for automated backups:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: openshift-etcd
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        metadata:
          labels:
            app: etcd-backup
        spec:
          serviceAccountName: etcd-backup
          hostNetwork: true
          hostPID: true
          nodeSelector:
            node-role.kubernetes.io/master: ""
          tolerations:
            - operator: "Exists"
          containers:
            - name: etcd-backup
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:... # Use cluster version
              command:
                - /bin/bash
                - -c
                - |
                  #!/bin/bash
                  set -euo pipefail

                  # Run backup script
                  /scripts/backup-etcd.sh

                  # Optional: Upload to S3/cloud storage
                  # aws s3 cp /host/home/core/etcd-backups/ s3://my-backups/etcd/ --recursive
              volumeMounts:
                - name: host
                  mountPath: /host
                - name: backup-script
                  mountPath: /scripts
              securityContext:
                privileged: true
                runAsUser: 0
          volumes:
            - name: host
              hostPath:
                path: /
                type: Directory
            - name: backup-script
              configMap:
                name: etcd-backup-script
                defaultMode: 0755
          restartPolicy: Never

Apply resources:

oc apply -f etcd-backup-configmap.yaml
oc apply -f etcd-backup-rbac.yaml
oc apply -f etcd-backup-cronjob.yaml

Test CronJob manually:

# Create manual job from CronJob
oc create job --from=cronjob/etcd-backup etcd-backup-manual -n openshift-etcd

# Watch job completion
oc get jobs -n openshift-etcd -w

# Check logs
oc logs -n openshift-etcd job/etcd-backup-manual

Success Indicators:

✅ CronJob created and scheduled
✅ Manual test job completes successfully
✅ Backup files appear in /home/core/etcd-backups on master
✅ Old backups cleaned up based on retention policy

Example 3: etcd Backup with S3 Upload

User: "Create etcd backup and upload to S3"

Skill Actions:

Create AWS credentials secret:

cat <<EOF > aws-credentials
[default]
aws_access_key_id=<ACCESS_KEY>
aws_secret_access_key=<SECRET_KEY>
EOF

oc create secret generic etcd-backup-aws-creds \
  --from-file=credentials=aws-credentials \
  -n openshift-etcd

rm aws-credentials

Create enhanced backup script with S3 upload:

apiVersion: v1
kind: ConfigMap
metadata:
  name: etcd-backup-script-s3
  namespace: openshift-etcd
data:
  backup-etcd.sh: |
    #!/bin/bash
    set -euo pipefail

    BACKUP_DIR="/host/home/core/etcd-backups"
    BACKUP_DATE=$(date +%Y%m%d-%H%M%S)
    S3_BUCKET="s3://my-etcd-backups"
    S3_PREFIX="openshift-cluster-1"
    RETENTION_DAYS=30

    echo "Starting etcd backup at $BACKUP_DATE"

    # Create backup directory
    mkdir -p $BACKUP_DIR

    # Run backup
    /usr/local/bin/cluster-backup.sh $BACKUP_DIR

    # Get backup files
    SNAPSHOT_FILE=$(ls -t $BACKUP_DIR/snapshot_*.db | head -1)
    STATIC_RESOURCES=$(ls -t $BACKUP_DIR/static_kuberesources_*.tar.gz | head -1)

    echo "Snapshot: $SNAPSHOT_FILE"
    echo "Static resources: $STATIC_RESOURCES"

    # Upload to S3
    echo "Uploading to S3..."
    aws s3 cp $SNAPSHOT_FILE $S3_BUCKET/$S3_PREFIX/$(basename $SNAPSHOT_FILE)
    aws s3 cp $STATIC_RESOURCES $S3_BUCKET/$S3_PREFIX/$(basename $STATIC_RESOURCES)

    # Create metadata file
    cat <<METADATA > $BACKUP_DIR/backup-metadata-$BACKUP_DATE.json
    {
      "timestamp": "$BACKUP_DATE",
      "snapshot_file": "$(basename $SNAPSHOT_FILE)",
      "static_resources": "$(basename $STATIC_RESOURCES)",
      "cluster_version": "$(oc version -o json | jq -r '.openshiftVersion')",
      "etcd_version": "$(oc exec -n openshift-etcd $ETCD_POD -- etcdctl version | head -1)"
    }
    METADATA

    # Upload metadata
    aws s3 cp $BACKUP_DIR/backup-metadata-$BACKUP_DATE.json \
      $S3_BUCKET/$S3_PREFIX/backup-metadata-$BACKUP_DATE.json

    echo "Upload completed successfully"

    # List S3 backups
    echo "Backups in S3:"
    aws s3 ls $S3_BUCKET/$S3_PREFIX/

    # Clean up local old backups
    find $BACKUP_DIR -name "snapshot_*.db" -mtime +7 -delete
    find $BACKUP_DIR -name "static_kuberesources_*.tar.gz" -mtime +7 -delete
    find $BACKUP_DIR -name "backup-metadata-*.json" -mtime +7 -delete

    echo "Backup process completed"

Update CronJob with S3 credentials:

# Add to CronJob container spec:
env:
  - name: AWS_SHARED_CREDENTIALS_FILE
    value: /aws/credentials
  - name: AWS_DEFAULT_REGION
    value: us-east-1
volumeMounts:
  - name: aws-credentials
    mountPath: /aws
    readOnly: true

# Add to volumes:
- name: aws-credentials
  secret:
    secretName: etcd-backup-aws-creds

Verification:

# Check S3 bucket
aws s3 ls s3://my-etcd-backups/openshift-cluster-1/

# Download and verify
aws s3 cp s3://my-etcd-backups/openshift-cluster-1/snapshot_latest.db ./
file snapshot_latest.db  # Should show: snapshot_latest.db: data

Example 4: Pre-Upgrade etcd Backup

User: "Create etcd backup before cluster upgrade"

Skill Actions:

Verify cluster health before backup:

# Check cluster operators
oc get co

# Check etcd cluster
oc get etcd cluster -o jsonpath='{.status.conditions[?(@.type=="EtcdMembersAvailable")]}' | jq .

# Verify no etcd leader elections happening
oc logs -n openshift-etcd -l app=etcd --tail=100 | grep -i "leader"

Create labeled backup:

UPGRADE_VERSION="4.14.8"
BACKUP_DATE=$(date +%Y%m%d-%H%M%S)
BACKUP_LABEL="pre-upgrade-$UPGRADE_VERSION"

# Access master and create backup
MASTER_NODE=$(oc get nodes -l node-role.kubernetes.io/master -o jsonpath='{.items[0].metadata.name}')

oc debug node/$MASTER_NODE <<'BACKUP_EOF'
chroot /host
mkdir -p /home/core/etcd-backups/pre-upgrade-$UPGRADE_VERSION
/usr/local/bin/cluster-backup.sh /home/core/etcd-backups/$BACKUP_LABEL
exit
exit
BACKUP_EOF

Document backup metadata:

# Create backup documentation
cat <<EOF > etcd-backup-metadata-$BACKUP_DATE.yaml
backup_type: pre-upgrade
backup_date: $BACKUP_DATE
cluster_version_before: $(oc version -o json | jq -r '.openshiftVersion')
target_upgrade_version: $UPGRADE_VERSION
cluster_name: $(oc get infrastructure cluster -o jsonpath='{.status.infrastructureName}')
etcd_members: $(oc get pods -n openshift-etcd -l app=etcd -o jsonpath='{.items[*].metadata.name}')
backup_location: /home/core/etcd-backups/$BACKUP_LABEL
notes: "Pre-upgrade backup before upgrading to $UPGRADE_VERSION"
EOF

# Store metadata
oc create configmap etcd-backup-metadata-$BACKUP_DATE \
  --from-file=metadata=etcd-backup-metadata-$BACKUP_DATE.yaml \
  -n openshift-etcd

Copy and archive:

# Copy off cluster
mkdir -p ./critical-backups/pre-upgrade-$UPGRADE_VERSION
oc debug node/$MASTER_NODE -- chroot /host tar czf - /home/core/etcd-backups/$BACKUP_LABEL | \
  tar xzf - -C ./critical-backups/pre-upgrade-$UPGRADE_VERSION/

# Create archive
tar czf etcd-pre-upgrade-$UPGRADE_VERSION-$BACKUP_DATE.tar.gz \
  -C ./critical-backups pre-upgrade-$UPGRADE_VERSION

echo "Backup archived: etcd-pre-upgrade-$UPGRADE_VERSION-$BACKUP_DATE.tar.gz"

Post-Backup Verification:

# Verify backup file integrity
tar tzf etcd-pre-upgrade-$UPGRADE_VERSION-$BACKUP_DATE.tar.gz | head -20

# Check backup size (should be several GB for production cluster)
ls -lh etcd-pre-upgrade-$UPGRADE_VERSION-$BACKUP_DATE.tar.gz

Monitoring and Verification

Check Backup Health

# Verify etcd snapshot integrity (basic check)
SNAPSHOT_FILE="/home/core/etcd-backups/snapshot_*.db"
file $SNAPSHOT_FILE  # Should show: data

# Check snapshot size (production clusters typically 1-10GB)
ls -lh $SNAPSHOT_FILE

# Verify static resources archive
tar tzf /home/core/etcd-backups/static_kuberesources_*.tar.gz | head -20

Monitor Automated Backups

# Check CronJob status
oc get cronjobs -n openshift-etcd

# View recent job runs
oc get jobs -n openshift-etcd -l app=etcd-backup --sort-by=.metadata.creationTimestamp

# Check latest backup logs
LATEST_JOB=$(oc get jobs -n openshift-etcd -l app=etcd-backup --sort-by=.metadata.creationTimestamp -o jsonpath='{.items[-1].metadata.name}')
oc logs -n openshift-etcd job/$LATEST_JOB

Troubleshooting

Backup Script Fails

Symptoms: cluster-backup.sh fails with errors

Common Causes:

Insufficient disk space:

# Check disk space on master
oc debug node/$MASTER_NODE -- chroot /host df -h /home/core

etcd unhealthy:

# Check etcd member health
oc exec -n openshift-etcd $ETCD_POD -- etcdctl endpoint health -w table

Permission issues:

# Verify running as root in debug pod
oc debug node/$MASTER_NODE -- chroot /host whoami

CronJob Not Running

Symptoms: No backup jobs created

Diagnosis:

# Check CronJob status
oc describe cronjob etcd-backup -n openshift-etcd

# Check for scheduling errors
oc get events -n openshift-etcd --field-selector involvedObject.name=etcd-backup

Common Issues:

Node selector not matching (no master nodes)
ServiceAccount permissions insufficient
Image pull errors

Cannot Access Master Node

Symptoms: oc debug fails to access master

Solutions:

# Try different master node
oc get nodes -l node-role.kubernetes.io/master
oc debug node/<different-master>

# Check SSH access (alternative method)
ssh core@<master-ip>
sudo /usr/local/bin/cluster-backup.sh /home/core/etcd-backups

Best Practices

Regular Schedule
- Daily backups minimum for production
- Before/after major changes
- Before cluster upgrades
- Store at least 7 days of backups
Off-Cluster Storage
- Always copy backups off cluster immediately
- Use different storage system than cluster
- Encrypt backups at rest and in transit
- Test backup accessibility regularly
Retention Policy
- Keep daily backups for 7 days
- Weekly backups for 4 weeks
- Monthly backups for 6-12 months
- Pre-upgrade backups indefinitely until verified
Backup Validation
- Test restore procedure quarterly
- Verify backup file integrity
- Document restore procedures
- Train team on recovery process
Monitoring
- Alert on backup failures
- Monitor backup size trends
- Track backup completion times
- Alert on missing scheduled backups
Documentation
- Record backup metadata (date, version, reason)
- Document recovery procedures
- Maintain inventory of backups
- Note any special considerations

Disaster Recovery Integration

Combine with OADP

While etcd backups are separate from OADP, combine both for complete DR:

# 1. Create etcd backup
/usr/local/bin/cluster-backup.sh /home/core/etcd-backups

# 2. Create OADP namespace backup
velero backup create full-cluster-backup \
  --include-namespaces "*" \
  --exclude-namespaces "openshift-*,kube-*" \
  --snapshot-volumes

# 3. Store both backups together
tar czf complete-cluster-backup-$(date +%Y%m%d).tar.gz \
  etcd-backups/ \
  oadp-backup-metadata/

Recovery Priority

In disaster recovery:

Restore etcd first (control plane)
Verify cluster comes up
Restore OADP backups (applications)
Verify application functionality

Next Steps

After creating etcd backups:

Test Restore: Practice etcd restore in test environment
Automate: Set up CronJob for regular automated backups
Monitor: Configure alerts for backup failures
Document: Create runbook for disaster recovery
Integrate: Combine with OADP application backups

Related Skills:

install-oadp - Install OADP operator
create-backup - Create application backups
diagnose-backup-issues - Troubleshoot backup problems

Resources

OpenShift etcd Backup Documentation: https://docs.openshift.com/container-platform/latest/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html
Disaster Recovery: https://docs.openshift.com/container-platform/latest/backup_and_restore/control_plane_backup_and_restore/disaster-recovery.html
etcd Documentation: https://etcd.io/docs/

Version: 1.0 Last Updated: 2025-11-17 Compatibility: OpenShift 4.10+ CRITICAL: etcd backups are essential for cluster disaster recovery

backup-etcd

Install Skill

SKILL.md

Backup etcd

When to Use This Skill

What This Skill Does

How to Use

Basic Usage

Automated Usage

Emergency Usage

Prerequisites

Examples

Example 1: Manual etcd Backup on Single Master

Example 2: Automated etcd Backup with CronJob

Example 3: etcd Backup with S3 Upload

Example 4: Pre-Upgrade etcd Backup

Monitoring and Verification

Check Backup Health

Monitor Automated Backups

Troubleshooting

Backup Script Fails

CronJob Not Running

Cannot Access Master Node

Best Practices

Disaster Recovery Integration

Combine with OADP

Recovery Priority

Next Steps

Resources