| name | deployment-patterns |
| description | Summarises deployment and rollback guidance for production releases. Use when planning a rollout, staging strategy, or recovery playbook; consult the detailed playbook for full procedures. |
| allowed-tools | Read, Grep, Glob |
Deployment Patterns - Stage 1: Metadata
Skill Overview
Name: Deployment Patterns Purpose: Create safe, reversible production deployments with fast recovery When to Use: Planning deployments, rollback strategies, risk-based rollouts Core Rule: Every deployment must be recoverable in < 5 minutes Sections Available: 3-Level Rollback, Staged Deployment, Monitoring Setup
Core Philosophy
Deployment = Risk Management: Minimize blast radius, maximize recovery speed
Key Principles:
- Feature flags for instant rollback (< 5 min)
- Staged rollouts catch issues early (10% → 50% → 100%)
- Monitor before proceeding (no alerts = proceed)
- Clear rollback triggers (no debate during incidents)
Stage 2: Quick Reference
Rollback Decision Framework
The 3-Level Escalation Model
Start with fastest rollback, escalate only if needed.
Level 1: Feature Flag (< 5 min)
# Toggle feature off
FEATURE_NAME_ENABLED=false
pm2 reload app
# Verify
curl https://api.example.com/health | jq '.features.feature_name'
# Expected: false
Use when: First response to ANY issue. Requires feature flag infrastructure.
Level 2: Configuration Rollback (< 10 min)
# Revert config files
git checkout HEAD~1 config/database.js config/auth.js
pm2 restart app
# Monitor
curl https://api.example.com/health
Use when: Feature flag insufficient, config changes caused issue.
Level 3: Code Rollback (< 15 min)
# Revert commit
git revert abc1234 --no-edit
# MANDATORY: Run tests
npm test
# Deploy
npm run deploy:production
# Monitor for 1 hour minimum
Use when: Configuration rollback insufficient, code changes broke production.
Rollback Triggers
Immediate Rollback (No Discussion)
- Security incident discovered
- Complete service outage (all users)
- Data loss or corruption
- Error rate >5% sustained
- Critical functionality completely broken
Evaluate Within 15 Minutes
- Error rate >0.5% sustained for 10 min
- API latency p95 >500ms sustained
- Success rate <99.5% for critical endpoints
- 10+ user reports of same issue
Monitor But Don't Rollback
- Error rate 0.2-0.5% (warning level)
- Latency spike (investigate first)
- Single user report (may be user error)
Deployment Strategy Decision Matrix
Choose deployment approach based on risk:
ZERO-RISK: Deploy Immediately
- Test files, documentation, code with flags OFF
- Wait: 0 hours
- Monitor: Deployment success only
LOW-RISK: Flag-Gated
- New files not yet integrated
- Utility functions not yet called
- Wait: 1 hour with flag OFF
- Monitor: No errors from new code (should be silent)
MEDIUM-RISK: Canary
- Middleware changes, API modifications
- Database migrations (backward compatible)
- Stages: 10% (2-4 hrs) → 50% (4-8 hrs) → 100%
- Monitor: Error rate, latency, success rate
HIGH-RISK: Extended Canary
- Authentication, payment processing
- Data writes, breaking DB migrations
- Stages: 1-5% (8-24 hrs) → 10% → 25% → 50% → 75% → 100%
- Each stage: 4-8 hours minimum
- Intensive monitoring at every stage
The 5-Stage Deployment Sequence
Stage 1: Infrastructure (ZERO-RISK)
Deploy tests, docs, code with flags OFF
- Wait: 0 hours
- Success: Build succeeds, app healthy
Stage 2: Integration (LOW-RISK)
All code deployed, flag still OFF
- Wait: 1 hour
- Success: No memory leaks, no new errors
Stage 3: Canary (MEDIUM-RISK)
Enable for 10% of users
- Wait: 2-4 hours
- Success: Error rate stable, latency within 10% baseline
- Rollback: Level 1 if ANY critical alert
Stage 4: Partial (MEDIUM-RISK)
Enable for 50% of users
- Wait: 4-8 hours (include overnight if possible)
- Success: Metrics stable at scale
Stage 5: Full Rollout (HIGH-RISK)
Enable for 100% of users
- Wait: 24 hours continuous monitoring
- Success: Metrics scale linearly, no surprises
Monitoring Metrics
Required Dashboard Metrics
| Metric | Baseline | Warning | Critical | Action |
|---|---|---|---|---|
| Error Rate | <0.05% | >0.2% | >0.5% | Level 1 rollback |
| Latency p95 | 150ms | >300ms | >500ms | Investigate, Level 1 if sustained |
| Success Rate | 99.95% | <99.8% | <99.5% | Level 1 rollback |
| Database Errors | <5/min | >10/min | >50/min | Level 1, investigate DB |
| Security Alert | 0 | ANY | ANY | Immediate Level 1 + incident |
Custom Metrics to Track
- Feature-specific metrics (e.g., JWT tokens issued)
- Database query performance
- Cache hit rates
- External API call rates
Deployment Timing
Best Times
- Tuesday-Thursday, 10am-2pm (optimal)
- Monday, 11am-2pm (acceptable)
Never Deploy
- Fridays after 12pm - Weekend incident risk
- Before holidays - Team unavailable
- After 4pm - Team leaving soon
- During peak traffic - Increased blast radius
Exception: Security hotfixes deploy ASAP (any time)
Pre-Deployment Checklist
Critical Checks:
- [ ] All tests passing (unit, integration, E2E)
- [ ] Code review complete
- [ ] Rollback strategy documented (which level to use)
- [ ] Monitoring dashboards configured
- [ ] Feature flags configured (if needed)
- [ ] On-call engineer assigned
- [ ] Deployment window confirmed (avoid Fridays!)
- [ ] Alert thresholds configured
- [ ] Database migrations backward compatible
- [ ] Stakeholders notified of schedule
Decision Framework: Blue/Green vs Canary vs Phased
Blue/Green (Infrastructure-Level):
- Use when: Need instant atomic switchover
- Best for: Infrastructure changes, complete version swaps
- Benefit: Instant rollback (flip load balancer)
- Cost: 2x infrastructure during deployment
Canary (Traffic-Level):
- Use when: Testing new code with subset of users
- Best for: Application changes, new features
- Benefit: Gradual risk exposure, real user testing
- Pattern: 10% → 50% → 100%
Phased (Feature-Level):
- Use when: Gradual feature adoption desired
- Best for: Major UX changes, new workflows
- Benefit: User feedback before full rollout
- Pattern: Internal → Beta users → All users
Rollback-Ready Design Patterns
1. Feature Flags
// Every new feature behind a flag
const isEnabled = process.env.NEW_AUTH_ENABLED === 'true';
if (isEnabled) {
return newAuthFlow(user);
}
return legacyAuthFlow(user);
2. Backward-Compatible Migrations
// ADD columns, don't REMOVE
// NEW tables, don't MODIFY existing
// Can rollback code without DB rollback
await db.schema.alterTable('users', (table) => {
table.string('new_field').nullable(); // Nullable for compatibility
});
3. Configuration-Driven Behavior
// Allow runtime changes without code deploy
const config = {
maxRetries: process.env.MAX_RETRIES || 3,
timeout: process.env.API_TIMEOUT || 5000
};
Post-Rollback Procedure
After ANY rollback:
- Notify team - #incidents channel
- Update status page - If customer impact
- Document incident - What, when, why rolled back
- Keep monitoring - 24 hours minimum
- Schedule postmortem - Within 48 hours
- Fix root cause - Before re-deployment attempt
For Detailed Procedures
This skill provides quick reference patterns and decision frameworks. For detailed runbooks, operational roles, and complete procedures, see:
Deployment Playbook includes:
- Detailed rollback procedures with commands
- Complete 5-stage deployment sequence
- Monitoring dashboard setup
- Postmortem templates
- Change ticket templates
- Environment readiness checks
When to consult playbook:
- Executing actual rollback (need exact commands)
- Setting up new deployment pipeline
- Creating monitoring dashboards
- Writing postmortem after incident
- Training team on deployment procedures
Integration with Other Skills
risk-analysis: Identify deployment risks before planningverification-before-completion: Evidence-based verification at each stageintegration-patterns: External API deployment coordinationlog-analysis-patterns: Monitoring and alerting setup