name	deployment-runbook
description	Deployment procedures, health checks, and rollback strategies. Use this skill when deploying applications, performing health checks, managing releases, or handling deployment failures. Provides systematic deployment workflows, verification scripts, and troubleshooting guides. Complements the devops-automation agent.

Deployment Runbook

Overview

This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.

When to Use This Skill

Planning production deployments
Executing staged rollouts (canary, blue-green)
Performing post-deployment health checks
Rolling back failed deployments
Troubleshooting deployment issues
Establishing deployment best practices
Complementing the devops-automation agent for deployments

Pre-Deployment Checklist

Before any production deployment:

Code Review: All changes reviewed and approved
Tests Pass: CI/CD pipeline green
- Unit tests: ✓
- Integration tests: ✓
- E2E tests: ✓
Database Migrations: Tested in staging
- Backward compatible
- Rollback script prepared
Configuration: Environment variables verified
- Secrets rotated if needed
- Feature flags configured
Monitoring: Dashboards and alerts ready
- Error tracking enabled
- Performance monitoring active
- Log aggregation configured
Communication: Stakeholders notified
- Deployment window announced
- On-call engineer assigned
- Rollback plan documented
Backups: Recent backup verified
- Database backed up < 1 hour ago
- Backup restoration tested
Capacity: Resources scaled appropriately
- Auto-scaling configured
- Rate limits reviewed
- CDN cache warmed

Deployment Strategies

1. Blue-Green Deployment

Best for: Zero-downtime deployments, easy rollbacks

Process:

Deploy to inactive (green) environment
Run health checks on green
Switch traffic from blue to green
Monitor for issues
Keep blue as instant rollback option

Commands:

# Deploy to green environment
./deploy.sh --env green

# Run health checks
python3 scripts/health_check.py --env green

# Switch traffic (gradual)
./switch_traffic.sh --from blue --to green --percentage 10
./switch_traffic.sh --from blue --to green --percentage 50
./switch_traffic.sh --from blue --to green --percentage 100

# If issues: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100

2. Canary Deployment

Best for: Risk-averse deployments, gradual rollouts

Process:

Deploy to small subset of servers (5-10%)
Monitor metrics closely
Gradually increase percentage
Roll back if metrics degrade

Monitoring During Canary:

Error rate < baseline + 1%
Response time < baseline + 10%
Success rate > 99.9%

3. Rolling Deployment

Best for: Standard updates, resource-constrained environments

Process:

Take one instance out of load balancer
Deploy new version
Run health checks
Add back to load balancer
Repeat for remaining instances

Deployment Workflow

Phase 1: Pre-Deployment (T-30 minutes)

# 1. Verify staging environment
./verify_staging.sh

# 2. Create deployment tag
git tag -a v1.2.3 -m "Release 1.2.3"
git push origin v1.2.3

# 3. Trigger production build
./build_production.sh --tag v1.2.3

# 4. Backup database
./backup_db.sh --environment production

# 5. Notify team
./notify_slack.sh "🚀 Starting deployment v1.2.3 in 30 minutes"

Phase 2: Deployment (T-0)

# 1. Enable maintenance mode (if needed)
./maintenance_mode.sh --enable

# 2. Run database migrations
./run_migrations.sh --environment production

# 3. Deploy application
./deploy.sh --environment production --version v1.2.3

# 4. Disable maintenance mode
./maintenance_mode.sh --disable

Phase 3: Post-Deployment Health Checks

# Run comprehensive health checks
python3 scripts/health_check.py --environment production

# Expected output:
# ✓ API health endpoint responding
# ✓ Database connectivity OK
# ✓ Cache layer accessible
# ✓ External services reachable
# ✓ Error rate within threshold
# ✓ Response time within SLA

Phase 4: Monitoring (T+30 minutes)

Monitor these metrics:

Application Metrics:

Error rate: < 0.1%
Response time (p95): < 200ms
Request throughput: within expected range
Success rate: > 99.9%

Infrastructure Metrics:

CPU utilization: < 70%
Memory usage: < 80%
Disk I/O: normal patterns
Network latency: < 50ms

Business Metrics:

Conversion rate: no significant drop
User signups: within expected range
Transaction volume: normal patterns

Rollback Procedures

When to Rollback

Rollback immediately if:

Error rate > 1%
Critical functionality broken
Data corruption detected
Security vulnerability introduced
Performance degradation > 50%

Rollback Methods

Method 1: Traffic Switch (Fastest)

# Blue-green: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100

# Verification
python3 scripts/health_check.py --environment production

Method 2: Version Revert

# Deploy previous version
./deploy.sh --environment production --version v1.2.2

# Run health checks
python3 scripts/health_check.py --environment production

Method 3: Database Rollback

# If migrations were applied
./rollback_migration.sh --environment production --steps 1

# Restore from backup (last resort)
./restore_db.sh --backup latest --environment production

Post-Rollback

Verify system health

python3 scripts/health_check.py --environment production

Notify stakeholders

./notify_slack.sh "⚠️ Deployment v1.2.3 rolled back. System stable on v1.2.2"

Create postmortem
- What went wrong?
- Why didn't we catch it?
- How do we prevent recurrence?

Health Check Script

Use the included health check script:

# Run all checks
python3 scripts/health_check.py --env production

# Run specific check
python3 scripts/health_check.py --env production --check api

# Verbose output
python3 scripts/health_check.py --env production --verbose

See scripts/health_check.py for implementation.

Troubleshooting Guide

Issue: Deployment Hangs

Symptoms:

Deployment script doesn't complete
Services not starting

Diagnosis:

# Check service logs
kubectl logs -f deployment/app-name

# Check events
kubectl get events --sort-by='.lastTimestamp'

Resolution:

Increase timeout values
Check resource constraints
Verify image pull secrets

Issue: High Error Rate Post-Deployment

Symptoms:

Error rate spike
500 errors in logs

Diagnosis:

# Check application logs
tail -f /var/log/app/error.log

# Check error distribution
grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr

Resolution:

Check configuration changes
Verify environment variables
Review recent code changes
Consider immediate rollback

Issue: Database Connection Failures

Symptoms:

"Connection refused" errors
Timeout errors

Diagnosis:

# Test database connectivity
python3 scripts/test_db_connection.py

# Check connection pool
psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"

Resolution:

Verify connection strings
Check firewall rules
Increase connection pool size
Verify credentials

Communication Templates

Pre-Deployment Announcement

🚀 **Production Deployment Scheduled**

**Version**: v1.2.3
**Time**: 2024-01-15 14:00 UTC (30 minutes)
**Duration**: ~15 minutes
**Impact**: No expected downtime

**Changes**:
- Feature: New user dashboard
- Fix: Payment processing bug
- Performance: API response time improvements

**Rollback Plan**: Blue-green switch (instant)
**On-Call**: @engineer-name

Deployment Success

✅ **Deployment Complete**

**Version**: v1.2.3
**Status**: Successful
**Duration**: 12 minutes

**Health Checks**: All passing ✓
**Metrics**: Within normal range
**Next Check**: T+30 minutes

Monitoring dashboard: [link]

Deployment Rollback

⚠️ **Deployment Rolled Back**

**Version**: v1.2.3 → v1.2.2 (rollback)
**Reason**: Elevated error rate (2.1%)
**Status**: System stable on v1.2.2

**Action Items**:
- [ ] Root cause analysis
- [ ] Fix identified issue
- [ ] Re-test in staging
- [ ] Schedule re-deployment

Incident report: [link]

Resources

scripts/

health_check.py: Comprehensive deployment health checks
test_db_connection.py: Database connectivity verification

references/

deployment-checklist.md: Detailed pre/post deployment checklist
monitoring-guide.md: Metrics to monitor during deployments

Best Practices

Always deploy during low-traffic windows
Never deploy on Fridays (unless critical hotfix)
Keep deployments small (< 200 lines changed)
Monitor for 30+ minutes post-deployment
Document every rollback with postmortem
Test rollback procedure in staging first
Use feature flags for risky changes
Automate health checks (don't rely on manual verification)

Quick Reference

Emergency Rollback:

./switch_traffic.sh --from green --to blue --percentage 100

Health Check:

python3 scripts/health_check.py --env production

View Logs:

kubectl logs -f deployment/app-name --tail=100

Check Metrics:

curl https://metrics.example.com/api/health

deployment-runbook

Install Skill

SKILL.md

Deployment Runbook

Overview

When to Use This Skill

Pre-Deployment Checklist

Deployment Strategies

1. Blue-Green Deployment

2. Canary Deployment

3. Rolling Deployment

Deployment Workflow

Phase 1: Pre-Deployment (T-30 minutes)

Phase 2: Deployment (T-0)

Phase 3: Post-Deployment Health Checks

Phase 4: Monitoring (T+30 minutes)

Rollback Procedures

When to Rollback

Rollback Methods

Method 1: Traffic Switch (Fastest)

Method 2: Version Revert

Method 3: Database Rollback

Post-Rollback

Health Check Script

Troubleshooting Guide

Issue: Deployment Hangs

Issue: High Error Rate Post-Deployment

Issue: Database Connection Failures

Communication Templates

Pre-Deployment Announcement

Deployment Success

Deployment Rollback

Resources

scripts/

references/

Best Practices

Quick Reference