| name | DevOps & Deployment |
| description | CI/CD pipelines, containerization, Kubernetes, and infrastructure as code patterns |
| version | 1.0.0 |
| category | Infrastructure & Deployment |
| agents | backend-system-architect, code-quality-reviewer, studio-coach |
| keywords | CI/CD, deployment, Docker, Kubernetes, pipeline, infrastructure, GitOps, container, automation, release |
DevOps & Deployment Skill
Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.
When to Use
- Setting up CI/CD pipelines
- Containerizing applications
- Deploying to Kubernetes or cloud platforms
- Implementing GitOps workflows
- Managing infrastructure as code
- Planning release strategies
Pipeline Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Code │──▶│ Build │──▶│ Test │──▶│ Deploy │
│ Commit │ │ & Lint │ │ & Scan │ │ & Release │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Triggers Artifacts Reports Monitoring
Key Concepts
CI/CD Pipeline Stages
- Lint & Type Check - Code quality gates
- Unit Tests - Test coverage with reporting
- Security Scan - npm audit + Trivy vulnerability scanner
- Build & Push - Docker image to container registry
- Deploy Staging - Environment-gated deployment
- Deploy Production - Manual approval or automated
See
templates/github-actions-pipeline.ymlfor complete GitHub Actions workflow
Container Best Practices
Multi-stage builds minimize image size:
- Stage 1: Install production dependencies only
- Stage 2: Build application with dev dependencies
- Stage 3: Production runtime with minimal footprint
Security hardening:
- Non-root user (uid 1001)
- Read-only filesystem where possible
- Health checks for orchestrator integration
See
templates/Dockerfileandtemplates/docker-compose.yml
Kubernetes Deployment
Essential manifests:
- Deployment with rolling update strategy
- Service for internal routing
- Ingress for external access with TLS
- HorizontalPodAutoscaler for scaling
Security context:
runAsNonRoot: trueallowPrivilegeEscalation: falsereadOnlyRootFilesystem: true- Drop all capabilities
Resource management:
- Always set requests and limits
- Use
requestsfor scheduling,limitsfor throttling
See
templates/k8s-manifests.yamlandtemplates/helm-values.yaml
Deployment Strategies
| Strategy | Use Case | Risk |
|---|---|---|
| Rolling | Default, gradual replacement | Low - automatic rollback |
| Blue-Green | Instant switch, easy rollback | Medium - double resources |
| Canary | Progressive traffic shift | Low - gradual exposure |
Rolling Update (Kubernetes default):
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0 # Zero downtime
Blue-Green: Deploy to standby environment, switch service selector Canary: Use Istio VirtualService for traffic splitting (10% → 50% → 100%)
Infrastructure as Code
Terraform patterns:
- Remote state in S3 with DynamoDB locking
- Module-based architecture (VPC, EKS, RDS)
- Environment-specific tfvars files
See
templates/terraform-aws.tffor AWS VPC + EKS + RDS example
GitOps with ArgoCD
ArgoCD watches Git repository and syncs cluster state:
- Automated sync with pruning
- Self-healing (drift detection)
- Retry policies for transient failures
See
templates/argocd-application.yaml
Secrets Management
Use External Secrets Operator to sync from cloud providers:
- AWS Secrets Manager
- HashiCorp Vault
- Azure Key Vault
- GCP Secret Manager
See
templates/external-secrets.yaml
Deployment Checklist
Pre-Deployment
- All tests passing in CI
- Security scans clean
- Database migrations ready
- Rollback plan documented
During Deployment
- Monitor deployment progress
- Watch error rates
- Verify health checks passing
Post-Deployment
- Verify metrics normal
- Check logs for errors
- Update status page
Helm Chart Structure
charts/app/
├── Chart.yaml
├── values.yaml
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── ingress.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── hpa.yaml
│ └── _helpers.tpl
└── values/
├── staging.yaml
└── production.yaml
CI/CD Pipeline Patterns
Branch Strategy
Recommended: Git Flow with Feature Branches
main (production) ─────●────────●──────▶
┃ ┃
dev (staging) ─────●───●────●───●──────▶
┃ ┃
feature/* ─────────●────────┘
▲
└─ PR required, CI checks, code review
Branch protection rules:
main: Require PR + 2 approvals + all checks passdev: Require PR + 1 approval + all checks pass- Feature branches: No direct commits to main/dev
GitHub Actions Caching Strategy
- name: Cache Dependencies
uses: actions/cache@v3
with:
path: |
~/.npm
node_modules
backend/.venv
key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-deps-
Cache hit ratio impact:
- Without cache: 2-3 min install time
- With cache: 10-20 sec install time
- ~85% time savings on typical workflows
Artifact Management
# Build and upload artifact
- name: Build Application
run: npm run build
- name: Upload Build Artifact
uses: actions/upload-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/
retention-days: 7
# Download in deployment job
- name: Download Build Artifact
uses: actions/download-artifact@v3
with:
name: build-${{ github.sha }}
path: dist/
Benefits:
- Avoid rebuilding in deployment job
- Deploy exact tested artifact (byte-for-byte match)
- Retention policies prevent storage bloat
Matrix Testing
strategy:
matrix:
node-version: [18, 20, 22]
os: [ubuntu-latest, windows-latest]
jobs:
test:
runs-on: ${{ matrix.os }}
steps:
- uses: actions/setup-node@v3
with:
node-version: ${{ matrix.node-version }}
- run: npm test
Container Optimization Deep Dive
Multi-Stage Build Example
# ============================================================
# Stage 1: Dependencies (builder)
# ============================================================
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force
# ============================================================
# Stage 2: Build (with dev dependencies)
# ============================================================
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci # Include dev dependencies
COPY . .
RUN npm run build && npm run test
# ============================================================
# Stage 3: Production runtime (minimal)
# ============================================================
FROM node:20-alpine AS runner
WORKDIR /app
# Security: Non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001
# Copy only production dependencies and built artifacts
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./
USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]
Image size comparison:
- Single-stage: 850 MB (includes dev dependencies, source files)
- Multi-stage: 180 MB (only runtime + production deps)
- 78% reduction
Layer Caching Optimization
Order matters for cache efficiency:
# ❌ BAD: Invalidates cache on any code change
COPY . .
RUN npm install
# ✅ GOOD: Cache package.json layer separately
COPY package*.json ./
RUN npm ci # Cached unless package.json changes
COPY . . # Source changes don't invalidate npm install
Security Scanning with Trivy
- name: Build Docker Image
run: docker build -t myapp:${{ github.sha }} .
- name: Scan for Vulnerabilities
uses: aquasecurity/trivy-action@master
with:
image-ref: 'myapp:${{ github.sha }}'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload Scan Results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
- name: Fail on Critical Vulnerabilities
run: |
trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}
Kubernetes Production Patterns
Health Probes
Three probe types with distinct purposes:
spec:
containers:
- name: app
# Startup probe (gives slow-starting apps time to boot)
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # 30 * 5s = 150s max startup time
# Liveness probe (restarts pod if failing)
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 3 # 3 failures = restart
# Readiness probe (removes from service if failing)
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2 # 2 failures = remove from load balancer
Probe implementation:
@app.get("/health/startup")
async def startup_check():
# Check DB connection established
if not db.is_connected():
raise HTTPException(status_code=503, detail="DB not ready")
return {"status": "ok"}
@app.get("/health/liveness")
async def liveness_check():
# Basic "is process running" check
return {"status": "alive"}
@app.get("/health/readiness")
async def readiness_check():
# Check all dependencies healthy
if not redis.ping() or not db.health_check():
raise HTTPException(status_code=503, detail="Dependencies unhealthy")
return {"status": "ready"}
PodDisruptionBudget
Prevents too many pods from being evicted during node maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: myapp
Use cases:
- Cluster upgrades (node drains)
- Autoscaler downscaling
- Manual evictions
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "10" # Total CPU requests
requests.memory: 20Gi # Total memory requests
limits.cpu: "20" # Total CPU limits
limits.memory: 40Gi # Total memory limits
pods: "50" # Max pods
StatefulSets for Databases
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
# Pod spec here
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Key differences from Deployment:
- Stable pod names (
postgres-0,postgres-1,postgres-2) - Ordered deployment and scaling
- Persistent storage per pod
Database Migration Strategies
Zero-Downtime Migration Pattern
Problem: Adding a NOT NULL column breaks old application versions
Solution: 3-phase migration
Phase 1: Add nullable column
-- Migration v1 (deploy with old code still running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);
Phase 2: Deploy new code + backfill
# New code writes to both old and new schema
def create_user(name: str, email: str):
# Write to new column
db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))
# Backfill existing rows
async def backfill_emails():
users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
for user in users_without_email:
email = generate_email(user.id)
await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))
Phase 3: Add constraint
-- Migration v2 (after backfill complete)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Backward/Forward Compatibility
Backward compatible changes (safe):
- ✅ Add nullable column
- ✅ Add table
- ✅ Add index
- ✅ Rename column (with view alias)
Backward incompatible changes (requires 3-phase):
- ❌ Remove column
- ❌ Rename column (no alias)
- ❌ Add NOT NULL column
- ❌ Change column type
Rollback Procedures
# Helm rollback to previous revision
helm rollback myapp 3
# Kubernetes rollback
kubectl rollout undo deployment/myapp
# Database migration rollback (Alembic example)
alembic downgrade -1
Critical: Test rollback procedures regularly!
Observability & Monitoring
Prometheus Metrics Exposition
from prometheus_client import Counter, Histogram, generate_latest
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
# Record metrics
http_requests_total.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
http_request_duration_seconds.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(content=generate_latest(), media_type="text/plain")
Grafana Dashboard Queries
# Request rate (requests per second)
rate(http_requests_total[5m])
# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)
Alerting Rules
groups:
- name: app-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High p95 latency detected"
description: "p95 latency is {{ $value }}s"
Real-World SkillForge Examples
Example 1: Local Development with Docker Compose
SkillForge's actual docker-compose.yml:
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_USER: skillforge
POSTGRES_PASSWORD: dev_password
POSTGRES_DB: skillforge_dev
ports:
- "5437:5432" # Avoid conflict with host postgres
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U skillforge"]
interval: 5s
timeout: 3s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
volumes:
- redisdata:/data
backend:
build:
context: ./backend
dockerfile: Dockerfile.dev
ports:
- "8500:8500"
environment:
DATABASE_URL: postgresql://skillforge:dev_password@postgres:5432/skillforge_dev
REDIS_URL: redis://redis:6379
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
volumes:
- ./backend:/app # Hot reload
frontend:
build:
context: ./frontend
dockerfile: Dockerfile.dev
ports:
- "5173:5173"
environment:
VITE_API_URL: http://localhost:8500
volumes:
- ./frontend:/app
- /app/node_modules # Avoid overwriting node_modules
volumes:
pgdata:
redisdata:
Key patterns:
- Port mapping to avoid host conflicts (5437:5432)
- Health checks before dependent services start
- Volume mounts for hot reload during development
- Named volumes for data persistence
Example 2: GitHub Actions Workflow
SkillForge's backend CI/CD pipeline:
name: Backend CI/CD
on:
push:
branches: [main, dev]
paths: ['backend/**']
pull_request:
branches: [main, dev]
paths: ['backend/**']
jobs:
lint-and-test:
runs-on: ubuntu-latest
defaults:
run:
working-directory: backend
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Cache Poetry Dependencies
uses: actions/cache@v3
with:
path: ~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('backend/poetry.lock') }}
- name: Install Poetry
run: pip install poetry
- name: Install Dependencies
run: poetry install
- name: Run Ruff Format Check
run: poetry run ruff format --check app/
- name: Run Ruff Lint
run: poetry run ruff check app/
- name: Run Type Check
run: poetry run mypy app/ --ignore-missing-imports
- name: Run Tests
run: poetry run pytest tests/ --cov=app --cov-report=xml
- name: Upload Coverage
uses: codecov/codecov-action@v3
with:
file: ./backend/coverage.xml
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Trivy Scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: 'backend/'
severity: 'CRITICAL,HIGH'
Key features:
- Path filtering (only run on backend changes)
- Poetry dependency caching
- Comprehensive quality checks (format, lint, type, test)
- Security scanning with Trivy
Example 3: Alembic Database Migrations
SkillForge migration pattern:
# backend/alembic/versions/2024_12_15_add_langfuse_trace_id.py
"""Add Langfuse trace_id to analyses
Revision ID: abc123def456
"""
from alembic import op
import sqlalchemy as sa
def upgrade():
# Add nullable column first (backward compatible)
op.add_column('analyses',
sa.Column('langfuse_trace_id', sa.String(255), nullable=True)
)
# Index for lookup performance
op.create_index('idx_analyses_langfuse_trace',
'analyses', ['langfuse_trace_id']
)
def downgrade():
op.drop_index('idx_analyses_langfuse_trace')
op.drop_column('analyses', 'langfuse_trace_id')
Migration workflow:
# Create new migration
poetry run alembic revision --autogenerate -m "Add langfuse trace ID"
# Review generated migration (ALWAYS review!)
cat alembic/versions/abc123_add_langfuse_trace_id.py
# Apply migration
poetry run alembic upgrade head
# Rollback if needed
poetry run alembic downgrade -1
Extended Thinking Triggers
Use Opus 4.5 extended thinking for:
- Architecture decisions - Kubernetes vs serverless, multi-region setup
- Migration planning - Moving between cloud providers
- Incident response - Complex deployment failures
- Security design - Zero-trust architecture
Templates Reference
| Template | Purpose |
|---|---|
github-actions-pipeline.yml |
Full CI/CD workflow with 6 stages |
Dockerfile |
Multi-stage Node.js build |
docker-compose.yml |
Development environment |
k8s-manifests.yaml |
Deployment, Service, Ingress |
helm-values.yaml |
Helm chart values |
terraform-aws.tf |
VPC, EKS, RDS infrastructure |
argocd-application.yaml |
GitOps application |
external-secrets.yaml |
Secrets Manager integration |