name	DevOps & Deployment
description	CI/CD pipelines, containerization, Kubernetes, and infrastructure as code patterns
version	1.0.0
category	Infrastructure & Deployment
agents	backend-system-architect, code-quality-reviewer, studio-coach
keywords	CI/CD, deployment, Docker, Kubernetes, pipeline, infrastructure, GitOps, container, automation, release

DevOps & Deployment Skill

Comprehensive frameworks for CI/CD pipelines, containerization, deployment strategies, and infrastructure automation.

When to Use

Setting up CI/CD pipelines
Containerizing applications
Deploying to Kubernetes or cloud platforms
Implementing GitOps workflows
Managing infrastructure as code
Planning release strategies

Pipeline Architecture

┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│    Code     │──▶│    Build    │──▶│    Test     │──▶│   Deploy    │
│   Commit    │   │   & Lint    │   │   & Scan    │   │  & Release  │
└─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘
       │                 │                 │                 │
       ▼                 ▼                 ▼                 ▼
   Triggers         Artifacts          Reports          Monitoring

Key Concepts

CI/CD Pipeline Stages

Lint & Type Check - Code quality gates
Unit Tests - Test coverage with reporting
Security Scan - npm audit + Trivy vulnerability scanner
Build & Push - Docker image to container registry
Deploy Staging - Environment-gated deployment
Deploy Production - Manual approval or automated

See templates/github-actions-pipeline.yml for complete GitHub Actions workflow

Container Best Practices

Multi-stage builds minimize image size:

Stage 1: Install production dependencies only
Stage 2: Build application with dev dependencies
Stage 3: Production runtime with minimal footprint

Security hardening:

Non-root user (uid 1001)
Read-only filesystem where possible
Health checks for orchestrator integration

See templates/Dockerfile and templates/docker-compose.yml

Kubernetes Deployment

Essential manifests:

Deployment with rolling update strategy
Service for internal routing
Ingress for external access with TLS
HorizontalPodAutoscaler for scaling

Security context:

runAsNonRoot: true
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Drop all capabilities

Resource management:

Always set requests and limits
Use requests for scheduling, limits for throttling

See templates/k8s-manifests.yaml and templates/helm-values.yaml

Deployment Strategies

Strategy	Use Case	Risk
Rolling	Default, gradual replacement	Low - automatic rollback
Blue-Green	Instant switch, easy rollback	Medium - double resources
Canary	Progressive traffic shift	Low - gradual exposure

Rolling Update (Kubernetes default):

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 25%
    maxUnavailable: 0  # Zero downtime

Blue-Green: Deploy to standby environment, switch service selector Canary: Use Istio VirtualService for traffic splitting (10% → 50% → 100%)

Infrastructure as Code

Terraform patterns:

Remote state in S3 with DynamoDB locking
Module-based architecture (VPC, EKS, RDS)
Environment-specific tfvars files

See templates/terraform-aws.tf for AWS VPC + EKS + RDS example

GitOps with ArgoCD

ArgoCD watches Git repository and syncs cluster state:

Automated sync with pruning
Self-healing (drift detection)
Retry policies for transient failures

See templates/argocd-application.yaml

Secrets Management

Use External Secrets Operator to sync from cloud providers:

AWS Secrets Manager
HashiCorp Vault
Azure Key Vault
GCP Secret Manager

See templates/external-secrets.yaml

Deployment Checklist

Pre-Deployment

All tests passing in CI
Security scans clean
Database migrations ready
Rollback plan documented

During Deployment

Monitor deployment progress
Watch error rates
Verify health checks passing

Post-Deployment

Verify metrics normal
Check logs for errors
Update status page

Helm Chart Structure

charts/app/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── ingress.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   └── _helpers.tpl
└── values/
    ├── staging.yaml
    └── production.yaml

CI/CD Pipeline Patterns

Branch Strategy

Recommended: Git Flow with Feature Branches

main (production) ─────●────────●──────▶
                       ┃        ┃
dev (staging) ─────●───●────●───●──────▶
                   ┃        ┃
feature/* ─────────●────────┘
                   ▲
                   └─ PR required, CI checks, code review

Branch protection rules:

main: Require PR + 2 approvals + all checks pass
dev: Require PR + 1 approval + all checks pass
Feature branches: No direct commits to main/dev

GitHub Actions Caching Strategy

- name: Cache Dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.npm
      node_modules
      backend/.venv
    key: ${{ runner.os }}-deps-${{ hashFiles('**/package-lock.json', '**/poetry.lock') }}
    restore-keys: |
      ${{ runner.os }}-deps-

Cache hit ratio impact:

Without cache: 2-3 min install time
With cache: 10-20 sec install time
~85% time savings on typical workflows

Artifact Management

# Build and upload artifact
- name: Build Application
  run: npm run build

- name: Upload Build Artifact
  uses: actions/upload-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/
    retention-days: 7

# Download in deployment job
- name: Download Build Artifact
  uses: actions/download-artifact@v3
  with:
    name: build-${{ github.sha }}
    path: dist/

Benefits:

Avoid rebuilding in deployment job
Deploy exact tested artifact (byte-for-byte match)
Retention policies prevent storage bloat

Matrix Testing

strategy:
  matrix:
    node-version: [18, 20, 22]
    os: [ubuntu-latest, windows-latest]
jobs:
  test:
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
      - run: npm test

Container Optimization Deep Dive

Multi-Stage Build Example

# ============================================================
# Stage 1: Dependencies (builder)
# ============================================================
FROM node:20-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && npm cache clean --force

# ============================================================
# Stage 2: Build (with dev dependencies)
# ============================================================
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci  # Include dev dependencies
COPY . .
RUN npm run build && npm run test

# ============================================================
# Stage 3: Production runtime (minimal)
# ============================================================
FROM node:20-alpine AS runner
WORKDIR /app

# Security: Non-root user
RUN addgroup -g 1001 -S nodejs && adduser -S nodejs -u 1001

# Copy only production dependencies and built artifacts
COPY --from=deps --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --chown=nodejs:nodejs package*.json ./

USER nodejs
EXPOSE 3000
ENV NODE_ENV=production
HEALTHCHECK --interval=30s --timeout=3s CMD node healthcheck.js || exit 1
CMD ["node", "dist/main.js"]

Image size comparison:

Single-stage: 850 MB (includes dev dependencies, source files)
Multi-stage: 180 MB (only runtime + production deps)
78% reduction

Layer Caching Optimization

Order matters for cache efficiency:

# ❌ BAD: Invalidates cache on any code change
COPY . .
RUN npm install

# ✅ GOOD: Cache package.json layer separately
COPY package*.json ./
RUN npm ci  # Cached unless package.json changes
COPY . .    # Source changes don't invalidate npm install

Security Scanning with Trivy

- name: Build Docker Image
  run: docker build -t myapp:${{ github.sha }} .

- name: Scan for Vulnerabilities
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: 'myapp:${{ github.sha }}'
    format: 'sarif'
    output: 'trivy-results.sarif'
    severity: 'CRITICAL,HIGH'

- name: Upload Scan Results
  uses: github/codeql-action/upload-sarif@v2
  with:
    sarif_file: 'trivy-results.sarif'

- name: Fail on Critical Vulnerabilities
  run: |
    trivy image --severity CRITICAL --exit-code 1 myapp:${{ github.sha }}

Kubernetes Production Patterns

Health Probes

Three probe types with distinct purposes:

spec:
  containers:
  - name: app
    # Startup probe (gives slow-starting apps time to boot)
    startupProbe:
      httpGet:
        path: /health/startup
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 5
      failureThreshold: 30  # 30 * 5s = 150s max startup time

    # Liveness probe (restarts pod if failing)
    livenessProbe:
      httpGet:
        path: /health/liveness
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
      failureThreshold: 3  # 3 failures = restart

    # Readiness probe (removes from service if failing)
    readinessProbe:
      httpGet:
        path: /health/readiness
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 2  # 2 failures = remove from load balancer

Probe implementation:

@app.get("/health/startup")
async def startup_check():
    # Check DB connection established
    if not db.is_connected():
        raise HTTPException(status_code=503, detail="DB not ready")
    return {"status": "ok"}

@app.get("/health/liveness")
async def liveness_check():
    # Basic "is process running" check
    return {"status": "alive"}

@app.get("/health/readiness")
async def readiness_check():
    # Check all dependencies healthy
    if not redis.ping() or not db.health_check():
        raise HTTPException(status_code=503, detail="Dependencies unhealthy")
    return {"status": "ready"}

PodDisruptionBudget

Prevents too many pods from being evicted during node maintenance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: myapp

Use cases:

Cluster upgrades (node drains)
Autoscaler downscaling
Manual evictions

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"      # Total CPU requests
    requests.memory: 20Gi   # Total memory requests
    limits.cpu: "20"        # Total CPU limits
    limits.memory: 40Gi     # Total memory limits
    pods: "50"              # Max pods

StatefulSets for Databases

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    # Pod spec here
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

Key differences from Deployment:

Stable pod names (postgres-0, postgres-1, postgres-2)
Ordered deployment and scaling
Persistent storage per pod

Database Migration Strategies

Zero-Downtime Migration Pattern

Problem: Adding a NOT NULL column breaks old application versions

Solution: 3-phase migration

Phase 1: Add nullable column

-- Migration v1 (deploy with old code still running)
ALTER TABLE users ADD COLUMN email VARCHAR(255);

Phase 2: Deploy new code + backfill

# New code writes to both old and new schema
def create_user(name: str, email: str):
    # Write to new column
    db.execute("INSERT INTO users (name, email) VALUES (%s, %s)", (name, email))

# Backfill existing rows
async def backfill_emails():
    users_without_email = await db.fetch("SELECT id FROM users WHERE email IS NULL")
    for user in users_without_email:
        email = generate_email(user.id)
        await db.execute("UPDATE users SET email = %s WHERE id = %s", (email, user.id))

Phase 3: Add constraint

-- Migration v2 (after backfill complete)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Backward/Forward Compatibility

Backward compatible changes (safe):

✅ Add nullable column
✅ Add table
✅ Add index
✅ Rename column (with view alias)

Backward incompatible changes (requires 3-phase):

❌ Remove column
❌ Rename column (no alias)
❌ Add NOT NULL column
❌ Change column type

Rollback Procedures

# Helm rollback to previous revision
helm rollback myapp 3

# Kubernetes rollback
kubectl rollout undo deployment/myapp

# Database migration rollback (Alembic example)
alembic downgrade -1

Critical: Test rollback procedures regularly!

Observability & Monitoring

Prometheus Metrics Exposition

from prometheus_client import Counter, Histogram, generate_latest

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.middleware("http")
async def prometheus_middleware(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    duration = time.time() - start_time

    # Record metrics
    http_requests_total.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")

Grafana Dashboard Queries

# Request rate (requests per second)
rate(http_requests_total[5m])

# Error rate (4xx/5xx as percentage)
sum(rate(http_requests_total{status=~"4..|5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# p95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Pod CPU usage
sum(rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])) by (pod)

Alerting Rules

groups:
- name: app-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High p95 latency detected"
      description: "p95 latency is {{ $value }}s"

Real-World Examples

Example 1: E-commerce Platform with Docker Compose

Full-stack e-commerce development environment:

version: '3.8'
services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: shopuser
      POSTGRES_PASSWORD: dev_password
      POSTGRES_DB: shop_dev
    ports:
      - "5437:5432"  # Avoid conflict with host postgres
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U shopuser"]
      interval: 5s
      timeout: 3s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile.dev
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL: postgresql://shopuser:dev_password@postgres:5432/shop_dev
      REDIS_URL: redis://redis:6379
      STRIPE_SECRET_KEY: ${STRIPE_SECRET_KEY}
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    volumes:
      - ./backend:/app  # Hot reload
      - /app/.venv  # Persist virtualenv

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile.dev
    ports:
      - "3000:3000"
    environment:
      VITE_API_URL: http://localhost:8000
      VITE_STRIPE_PUBLIC_KEY: ${VITE_STRIPE_PUBLIC_KEY}
    volumes:
      - ./frontend:/app
      - /app/node_modules  # Avoid overwriting node_modules

  worker:
    build:
      context: ./backend
      dockerfile: Dockerfile.dev
    command: celery -A app.worker worker --loglevel=info
    environment:
      DATABASE_URL: postgresql://shopuser:dev_password@postgres:5432/shop_dev
      REDIS_URL: redis://redis:6379
    depends_on:
      - postgres
      - redis
    volumes:
      - ./backend:/app

volumes:
  pgdata:
  redisdata:

Key patterns:

Port mapping to avoid host conflicts (5437:5432)
Health checks before dependent services start
Volume mounts for hot reload during development
Named volumes for data persistence

Example 2: GitHub Actions CI/CD Pipeline (2025)

Modern FastAPI backend with Ruff and uv:

name: Backend CI/CD

on:
  push:
    branches: [main, dev]
    paths: ['backend/**']
  pull_request:
    branches: [main, dev]
    paths: ['backend/**']

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  lint-and-test:
    runs-on: ubuntu-24.04-latest  # Latest LTS runner
    defaults:
      run:
        working-directory: backend

    services:
      postgres:
        image: postgres:16-alpine
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_DB: test_db
        ports: ["5432:5432"]
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

      redis:
        image: redis:7-alpine
        ports: ["6379:6379"]
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python 3.12
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'

      - name: Install uv
        uses: astral-sh/setup-uv@v4
        with:
          version: "latest"

      - name: Cache dependencies
        uses: actions/cache@v4
        with:
          path: ~/.cache/uv
          key: ${{ runner.os }}-uv-${{ hashFiles('backend/uv.lock') }}

      - name: Install dependencies
        run: uv sync --frozen --no-dev

      - name: Lint - Ruff format check
        run: uv run ruff format --check app/

      - name: Lint - Ruff code check
        run: uv run ruff check app/

      - name: Type check with mypy
        run: uv run mypy app/ --strict --ignore-missing-imports

      - name: Run tests with coverage
        env:
          DATABASE_URL: postgresql://postgres:test@localhost:5432/test_db
          REDIS_URL: redis://localhost:6379
        run: uv run pytest tests/ --cov=app --cov-report=xml --cov-report=term -v

      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v4
        with:
          file: ./backend/coverage.xml
          flags: backend
          token: ${{ secrets.CODECOV_TOKEN }}

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: 'backend/'
          severity: 'CRITICAL,HIGH'
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy results to GitHub Security
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'

  docker-build:
    runs-on: ubuntu-latest
    needs: [lint-and-test, security-scan]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push Docker image
        uses: docker/build-push-action@v5
        with:
          context: ./backend
          push: true
          tags: |
            ghcr.io/${{ github.repository }}/backend:latest
            ghcr.io/${{ github.repository }}/backend:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Key features:

Latest GitHub Actions runners (ubuntu-24.04)
uv for fast Python package management
Service containers for integration tests
Concurrency control to cancel outdated runs
Docker Buildx with GitHub Actions cache
SARIF upload for security findings

Example 3: Database Migration with Alembic (2025)

E-commerce order tracking migration:

# backend/alembic/versions/2025_12_29_add_order_tracking.py
"""Add order tracking with shipment provider

Revision ID: a1b2c3d4e5f6
Revises: prev_revision_id
Create Date: 2025-12-29 10:30:00
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql

# Revision identifiers
revision = 'a1b2c3d4e5f6'
down_revision = 'prev_revision_id'
branch_labels = None
depends_on = None

def upgrade():
    # Add tracking column as nullable first (backward compatible)
    op.add_column('orders',
        sa.Column('tracking_number', sa.String(100), nullable=True)
    )
    op.add_column('orders',
        sa.Column('shipping_provider', sa.String(50), nullable=True)
    )
    op.add_column('orders',
        sa.Column('shipped_at', sa.DateTime(timezone=True), nullable=True)
    )

    # Add composite index for tracking lookups
    op.create_index(
        'idx_orders_tracking',
        'orders',
        ['tracking_number', 'shipping_provider'],
        unique=False
    )

    # Add check constraint for valid providers
    op.create_check_constraint(
        'ck_orders_shipping_provider',
        'orders',
        "shipping_provider IN ('USPS', 'UPS', 'FedEx', 'DHL')"
    )

def downgrade():
    op.drop_constraint('ck_orders_shipping_provider', 'orders', type_='check')
    op.drop_index('idx_orders_tracking', 'orders')
    op.drop_column('orders', 'shipped_at')
    op.drop_column('orders', 'shipping_provider')
    op.drop_column('orders', 'tracking_number')

Migration workflow with uv (2025):

# Create new migration (auto-generate from models)
uv run alembic revision --autogenerate -m "Add order tracking with shipment provider"

# ALWAYS review the generated migration!
cat alembic/versions/*_add_order_tracking.py

# Test migration in development
uv run alembic upgrade head

# Verify schema changes
uv run alembic current
psql $DATABASE_URL -c "\d orders"

# Rollback if issues found
uv run alembic downgrade -1

# Check migration history
uv run alembic history --verbose

Production migration checklist:

# 1. Backup database
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql

# 2. Test migration on staging with production data snapshot
uv run alembic upgrade head

# 3. Monitor query performance
EXPLAIN ANALYZE SELECT * FROM orders WHERE tracking_number = 'TRACK123';

# 4. Deploy with zero downtime (nullable columns)
# Phase 1: Add nullable columns
# Phase 2: Backfill data
# Phase 3: Add NOT NULL constraint (if needed)

Extended Thinking Triggers

Use Opus 4.5 extended thinking for:

Architecture decisions - Kubernetes vs serverless, multi-region setup
Migration planning - Moving between cloud providers
Incident response - Complex deployment failures
Security design - Zero-trust architecture

Templates Reference

Template	Purpose
`github-actions-pipeline.yml`	Full CI/CD workflow with 6 stages
`Dockerfile`	Multi-stage Node.js build
`docker-compose.yml`	Development environment
`k8s-manifests.yaml`	Deployment, Service, Ingress
`helm-values.yaml`	Helm chart values
`terraform-aws.tf`	VPC, EKS, RDS infrastructure
`argocd-application.yaml`	GitOps application
`external-secrets.yaml`	Secrets Manager integration

Install Skill

SKILL.md