| name | docker-deployment-guardian |
| description | Offensive Docker deployment optimization and pre-flight validation. Triggered when reviewing Dockerfiles, docker-compose.yml, container configurations, or preparing for production deployments. Scans for architecture issues, missing static assets, native module problems, environment misconfigurations, and resource optimization opportunities. Produces auto-scan reports with actionable improvements. |
Docker Deployment Guardian
Mission: Prevent Docker deployment failures through proactive architecture scanning and optimization. This skill operates in offensive mode - seeking opportunities to improve, not just catch errors.
Activation Triggers
- User mentions Docker setup review
- Code review for deployments
- Pre-production validation
- Performance optimization requests
- Dockerfile or docker-compose.yml analysis
Scan Methodology
1. Initial Context Gathering
Ask if not provided:
- "Show me your Dockerfile and docker-compose.yml"
- "What's your application stack?" (Node.js/Python/etc.)
- "Are you using native modules?" (bcrypt, node-gyp, etc.)
- "What static assets need to be served?" (views, templates, public files)
2. Critical Architecture Scan
Execute ALL checks in this section. Each is based on real production incidents.
🔴 CRITICAL: Static Asset Copying
Historical Failure: Missing views folder caused 100% production outage
Scan for:
- Explicit COPY commands for ALL static directories (views, templates, public, uploads, etc.)
- Build output verification - does TypeScript/Webpack output get copied?
- Template engine requirements (EJS, Pug, Handlebars) vs copied files
Red flags:
- Only copying
package.jsonandsrc/without templates - Relying on implicit copying from build tools
- No verification step that files exist in container
Optimization:
# ❌ RISKY - May miss templates
COPY src/ /app/src/
# ✅ EXPLICIT - Guarantees presence
COPY src/ /app/src/
COPY views/ /app/views/
COPY public/ /app/public/
RUN ls -la /app/views /app/public # Verification
🔴 CRITICAL: Native Module Build Process
Historical Failure: bcrypt native bindings failed in container
Scan for:
- Native modules in package.json (bcrypt, node-gyp, sharp, sqlite3, etc.)
- Post-install rebuild commands
- Build tools in base image (python, make, g++)
- Architecture matching (ARM vs x64)
Red flags:
- Native modules without rebuild step
- Missing build dependencies in Dockerfile
- No platform-specific build instructions
Optimization:
# ✅ COMPLETE - Native module handling
FROM node:18-alpine
RUN apk add --no-cache python3 make g++
COPY package*.json ./
RUN npm ci
RUN npm rebuild bcrypt --build-from-source
🟡 HIGH: Environment Variable Validation
Historical Failure: MySQL config mismatch between dev and prod
Scan for:
- All required ENV vars documented
- Default values provided in Dockerfile or docker-compose
- Sensitive data handled via secrets/volumes
- Port consistency across services
Red flags:
- Hard-coded database hosts
- Missing ENV validation logic
- Port conflicts (multiple services on 3306, etc.)
- No .env.example file
Optimization:
# ✅ CLEAR - Explicit env management
services:
api:
environment:
- DATABASE_HOST=${DB_HOST:-mysql} # Default fallback
- DATABASE_PORT=${DB_PORT:-3306}
- NODE_ENV=${NODE_ENV:-production}
env_file:
- .env
🟡 HIGH: Container Health & Lifecycle
Historical Failure: Worker container in restart loop consuming resources
Scan for:
- Health checks defined for all services
- Restart policies appropriate for service type
- Startup/readiness probes
- Graceful shutdown handling
Red flags:
restart: alwayson job/worker containers- No health checks on web services
- Missing SIGTERM handlers
- Unused containers still defined
Optimization:
# ✅ ROBUST - Proper lifecycle management
services:
api:
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
🟠 MEDIUM: Resource Allocation
Historical Finding: 250MB RAM saved by removing unused worker
Scan for:
- Memory limits defined
- CPU constraints appropriate
- Unused services removed
- Resource monitoring planned
Red flags:
- No resource limits (unlimited potential)
- Identical limits for all services (not tuned)
- Development containers in production compose
Optimization:
# ✅ TUNED - Based on actual usage
services:
api:
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
🟢 LOW: Build Optimization
Not failure-critical but improves performance
Scan for:
- Multi-stage builds
- Layer caching strategy
- .dockerignore configured
- Minimal base images
Optimization opportunities:
- Switch alpine where possible (-70% image size)
- Copy package.json before source (better caching)
- Remove dev dependencies in production stage
3. Architecture Review
Cross-cutting concerns:
- Service isolation: Should worker be separate container?
- Network segmentation: Backend services exposed unnecessarily?
- Secrets management: Using Docker secrets vs ENV vars?
- Logging strategy: stdout/stderr vs volume mounts?
- Backup procedures: Volumes documented and backed up?
4. Production Readiness Checklist
Generate this checklist in the auto-scan report:
PRODUCTION READINESS SCORE: X/10
✅ Static assets explicitly copied
✅ Native modules rebuilt in container
✅ Environment variables validated
✅ Health checks configured
✅ Resource limits defined
⚠️ No monitoring solution specified
⚠️ Backup procedures not documented
❌ Missing: .dockerignore file
❌ Critical: No graceful shutdown handling
❌ Critical: Port conflicts detected
RISK LEVEL: [LOW/MEDIUM/HIGH/CRITICAL]
BLOCKERS: X critical issues must be resolved
RECOMMENDATIONS: Y optimizations suggested
Output Format: Auto-Scan Report
═══════════════════════════════════════════════
🛡️ DOCKER DEPLOYMENT GUARDIAN - SCAN RESULTS
═══════════════════════════════════════════════
📊 SCAN SCOPE
• Dockerfile: [filename]
• Compose: [filename]
• Stack: [technology]
• Services: [count]
🚨 CRITICAL FINDINGS: [count]
[List each critical issue with:
- What's wrong
- Why it's dangerous (cite historical incident)
- How to fix (code example)]
⚠️ HIGH PRIORITY: [count]
[Same format as critical]
💡 OPTIMIZATIONS: [count]
[Improvements that enhance but don't block]
═══════════════════════════════════════════════
FINAL VERDICT
═══════════════════════════════════════════════
Production Ready: [YES/NO/BLOCKED]
Risk Level: [LOW/MEDIUM/HIGH/CRITICAL]
Estimated Fix Time: [X hours]
NEXT ACTIONS:
1. [Most critical fix]
2. [Second priority]
3. [Optional optimization]
═══════════════════════════════════════════════
Reference Materials
For detailed error patterns and historical incidents, see:
references/error-patterns.md- Complete incident database with resolutions
Advanced Scanning
When to escalate:
- User says "comprehensive audit"
- Production incident occurred
- Multiple services (>5 containers)
- Complex networking/volumes
Escalation actions:
- Read entire docker-compose including networks, volumes
- Scan for security issues (exposed ports, root user, etc.)
- Analyze resource usage patterns
- Review CI/CD integration points
Key Principles
- Offensive mindset: Don't just find errors, find improvements
- Evidence-based: Every check maps to a real historical incident
- Actionable: Every finding includes code example fix
- Honest: Report actual risk, not theoretical worst-case
- Fast: Complete scan in <2 minutes of LLM time