name	devops-excellence
description	DevOps and CI/CD expert. Use when setting up pipelines, containerizing applications, deploying to Kubernetes, or implementing release strategies. Covers GitHub Actions, Docker, K8s, Terraform, and GitOps.

DevOps Excellence

Core Principles

Shift Left — Address security and quality early in SDLC
GitOps — Git as single source of truth for infrastructure and deployments
Infrastructure as Code — All infrastructure versioned and reproducible
Progressive Delivery — Gradual rollouts with feature flags and canary releases
Immutable Infrastructure — Replace, don't modify running systems
Observability-First — Monitor metrics tied to deployments and features
Policy as Code — Enforce compliance and security automatically
Platform Engineering — Build golden paths and self-service portals

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

No Static Credentials

Never use long-lived static credentials. Always use OIDC or short-lived tokens.

# ❌ FORBIDDEN: Static AWS credentials
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

# ✅ REQUIRED: OIDC-based authentication
- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubActions
    aws-region: us-east-1
    # No long-lived secrets - uses GitHub OIDC provider

No Root Containers

Containers must NEVER run as root. Always specify a non-root user.

# ❌ FORBIDDEN: Running as root (default)
FROM node:20
WORKDIR /app
CMD ["node", "server.js"]

# ❌ FORBIDDEN: Explicit root user
USER root

# ✅ REQUIRED: Non-root user with UID > 1000
FROM node:20-alpine
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001
USER nodejs
WORKDIR /app
CMD ["node", "server.js"]

No Secrets in Images

Never bake secrets into Docker images. Use runtime injection or secrets managers.

# ❌ FORBIDDEN: Secrets in build args or ENV
ARG DATABASE_PASSWORD
ENV API_KEY=sk-xxx

# ❌ FORBIDDEN: Copying secret files
COPY .env /app/.env
COPY credentials.json /app/

# ✅ REQUIRED: Mount secrets at runtime
# docker run -v /secrets:/app/secrets:ro myapp
# Or use Kubernetes secrets/configmaps

Protected Production Deployments

Production deployments must require approval and be restricted to main branch.

# ❌ FORBIDDEN: Direct production deploy without protection
deploy:
  runs-on: ubuntu-latest
  steps:
    - run: deploy-to-prod.sh

# ✅ REQUIRED: Environment protection
deploy:
  runs-on: ubuntu-latest
  environment:
    name: production
    url: https://myapp.com
  # Requires: approval + main branch only

Quick Reference

When to Use What

Scenario	Tool/Pattern	Reason
Public GitHub project	GitHub Actions	Native integration, free for public repos
Enterprise GitLab	GitLab CI	Unified platform, advanced security scanning
Multi-cloud IaC	Terraform	Mature ecosystem, wide provider support
Developer-centric IaC	Pulumi	Real programming languages, better testing
Kubernetes deployments	ArgoCD + Kustomize	GitOps standard, declarative config
Zero-downtime releases	Blue-Green or Canary	Instant rollback capability
Gradual feature rollout	Feature flags (LaunchDarkly)	Progressive delivery with targeting

Deployment Strategy Selection

Strategy	Downtime	Cost	Rollback Speed	Complexity	Best For
Rolling	Minimal	Low	Medium	Low	Regular updates, cost-conscious
Blue-Green	Zero	High (2x)	Instant	Medium	Critical systems, easy rollback
Canary	Zero	Medium	Fast	High	Risk mitigation, data-driven
Recreate	High	Low	N/A	Very Low	Non-critical, dev/test only

CI/CD Pipeline Best Practices

Pipeline Security

# Short-lived credentials (not static keys)
- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubActions
    aws-region: us-east-1
    # OIDC provider - no long-lived secrets!

# Protected environments for production
environment:
  name: production
  # Requires approval + restricts to main branch

Speed Optimization

10-minute build rule — Most projects should build in <10 minutes
Parallel jobs — Run tests, linting, security scans concurrently
Cache dependencies — Cache node_modules, .m2, pip packages
Conditional execution — Skip jobs when files haven't changed

# Example: conditional job execution
jobs:
  backend-tests:
    if: contains(github.event.head_commit.modified, 'backend/')
    runs-on: ubuntu-latest

Testing Pyramid

              /\
             /E2E\        <- Few (slow, expensive)
            /------\
           /Integration\ <- Some (medium speed)
          /------------\
         /  Unit Tests  \ <- Many (fast, cheap)
        /----------------\

70% Unit tests (fast, isolated)
20% Integration tests (service interactions)
10% E2E tests (full user workflows)

Security Scanning Integration

# Multi-layer security scanning
jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # SAST - Static code analysis
      - uses: github/codeql-action/init@v3

      # SCA - Dependency vulnerabilities
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          format: 'sarif'

      # Secret scanning
      - name: Gitleaks
        uses: gitleaks/gitleaks-action@v2

      # Container scanning
      - name: Scan Docker image
        run: trivy image myapp:${{ github.sha }}

Docker Best Practices

Multi-Stage Builds

# Build stage - includes build tools (900MB+)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Runtime stage - minimal image (<100MB)
FROM node:20-alpine AS runtime
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001
WORKDIR /app
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs
EXPOSE 3000
CMD ["node", "server.js"]

Security Hardening

Non-root user — ALWAYS run as non-root (UID 1001)
Minimal base images — Use alpine, distroless, or scratch
Read-only filesystem — docker run --read-only
No secrets in layers — Use build secrets or external vaults
Resource limits — Set CPU/memory limits to prevent DoS
Signed images — Enable Docker Content Trust

# Security best practices example
FROM gcr.io/distroless/nodejs20-debian12
COPY --chown=65532:65532 /app /app
USER 65532
EXPOSE 8080

.dockerignore

# Version control
.git
.gitignore

# Dependencies (install fresh in container)
node_modules
vendor/
*.pyc
__pycache__

# Secrets and configs
.env
.env.local
secrets/
*.key
*.pem

# Development files
README.md
Dockerfile
docker-compose.yml
.vscode/
.idea/

# Testing and CI
tests/
*.test.js
.github/

Kubernetes Deployment Patterns

Resource Management (Right-Sizing)

# 99.94% of clusters are over-provisioned!
# Average CPU usage: 10%, Memory: 23%
resources:
  requests:
    memory: "128Mi"  # Guaranteed allocation
    cpu: "100m"      # 0.1 CPU cores
  limits:
    memory: "256Mi"  # Maximum allowed
    cpu: "200m"      # Hard cap

# Use tools: Kubecost, Goldilocks, VPA

Health Checks

# Liveness: Is container alive?
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# Readiness: Can it receive traffic?
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  successThreshold: 1

# Startup: Has initialization completed?
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30  # 30*10s = 5min for slow starts
  periodSeconds: 10

ConfigMaps and Secrets

# Group related resources in single manifest
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  APP_ENV: production
  LOG_LEVEL: info
---
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:
  DATABASE_URL: postgresql://user:pass@db:5432/mydb
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: app
        envFrom:
        - configMapRef:
            name: app-config
        - secretRef:
            name: app-secrets

Security Best Practices

# Pod Security Standards
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
    - ALL

# Network Policies (deny-by-default)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Infrastructure as Code (Terraform/Pulumi)

Directory Structure

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
├── backend.tf        # Remote state config
└── versions.tf       # Provider versions

Best Practices

1. Remote State with Locking

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # Prevents concurrent runs
  }
}

2. Modularization

# modules/vpc/main.tf
variable "cidr_block" {
  type        = string
  description = "VPC CIDR block"
}

resource "aws_vpc" "main" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  tags = {
    Name = "${var.environment}-vpc"
  }
}

# environments/prod/main.tf
module "vpc" {
  source     = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
  environment = "prod"
}

3. Policy as Code

# Use Sentinel (Terraform Cloud) or OPA
policy "enforce-tags" {
  enforcement_level = "hard-mandatory"

  # Require tags on all resources
  rule {
    condition = all resource.tags contains "Owner"
    error_message = "All resources must have Owner tag"
  }
}

4. Automated Testing

// Terratest example
func TestVPCCreation(t *testing.T) {
  terraformOptions := &terraform.Options{
    TerraformDir: "../environments/dev",
  }

  defer terraform.Destroy(t, terraformOptions)
  terraform.InitAndApply(t, terraformOptions)

  vpcId := terraform.Output(t, terraformOptions, "vpc_id")
  assert.NotEmpty(t, vpcId)
}

Pulumi Advantages

// Pulumi - real programming language benefits
import * as aws from "@pulumi/aws";

const environments = ["dev", "staging", "prod"];

// Use loops, conditionals, functions
environments.forEach(env => {
  new aws.ec2.Vpc(`${env}-vpc`, {
    cidrBlock: env === "prod" ? "10.0.0.0/16" : "10.1.0.0/16",
    tags: { Environment: env },
  });
});

// Built-in testing framework
import * as pulumi from "@pulumi/pulumi";
pulumi.runtime.setMocks(...);

Release Strategies

Blue-Green Deployment

# Two identical environments
# Switch traffic instantly via load balancer

# Step 1: Deploy to Green (idle)
# Step 2: Test Green environment
# Step 3: Switch LB from Blue to Green
# Step 4: Keep Blue as rollback option

# Kubernetes example
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80

When to use:

Critical systems requiring instant rollback
Compliance requirements for zero downtime
Budget allows 2x infrastructure

Canary Deployment

# Gradual rollout: 5% → 25% → 50% → 100%
# Monitor metrics at each stage

# Argo Rollouts example
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10      # 1 pod (10%)
      - pause: {duration: 5m}
      - setWeight: 50      # 5 pods
      - pause: {duration: 10m}
      - setWeight: 100     # All pods
  template:
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0

When to use:

High-risk deployments (major refactors)
User-facing features needing validation
Data-driven rollout decisions

Rolling Update

# Default Kubernetes strategy
# Gradually replace old pods with new

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Never < 9 pods available
      maxSurge: 2        # Never > 12 pods total

When to use:

Regular incremental updates
Cost-conscious deployments
Low-risk changes

Feature Flags and Progressive Delivery

Best Practices

1. Flag Lifecycle Management

// Avoid "flag debt" - remove after rollout
const featureFlags = {
  // Short-lived (remove after 100% rollout)
  "new-checkout-v4": {
    enabled: true,
    rollout: 100,
    created: "2025-01-15",
    removeBy: "2025-02-15"
  },

  // Long-lived (kill switch)
  "payment-processing": {
    enabled: true,
    permanent: true,  // Document why
    reason: "Emergency shutoff for payment issues"
  }
};

2. Progressive Rollout

// LaunchDarkly example
const showNewFeature = ldClient.variation(
  "new-dashboard-ui",
  user,
  false  // Default fallback
);

// Configuration
{
  "targeting": {
    "rules": [
      {
        "variation": "on",
        "clauses": [
          {
            "attribute": "email",
            "op": "endsWith",
            "values": ["@mycompany.com"]
          }
        ]
      }
    ],
    "rollout": {
      "percentage": 10  // 10% of remaining users
    }
  }
}

3. Segment Meaningfully

Geographic: Region-specific rollouts
Behavioral: Power users first, then general
Technical: Browser/device-based targeting
Business: Premium tier vs free tier

4. Observability Integration

// Tie metrics to feature flags
metrics.increment('checkout.completed', {
  feature_flag: 'new-checkout-v4',
  enabled: showNewCheckout
});

// Automatic rollback on error spike
if (errorRate > threshold) {
  ldClient.updateFeatureFlag('new-checkout-v4', { enabled: false });
  alerts.critical('Auto-rollback triggered for new-checkout-v4');
}

GitOps Practices

Core Principles

Declarative — Entire system state in Git
Versioned — Git history = audit trail
Immutable — Git commits are immutable
Automatic — Agents auto-sync cluster to Git state
Continuous — Reconciliation loop detects drift

ArgoCD Workflow

# Application definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    targetRevision: main
    path: apps/myapp
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Auto-sync on drift detection
    syncOptions:
    - CreateNamespace=true

Repository Structure

k8s-manifests/
├── apps/
│   ├── myapp/
│   │   ├── base/
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   └── kustomization.yaml
│   │   └── overlays/
│   │       ├── dev/
│   │       ├── staging/
│   │       └── prod/
│   │           ├── kustomization.yaml
│   │           └── replicas-patch.yaml
├── infrastructure/
│   ├── ingress-nginx/
│   └── cert-manager/
└── argocd/
    ├── projects.yaml
    └── applications.yaml

Policy Enforcement

# OPA Gatekeeper - deny images without tags
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-owner-label
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment"]
  parameters:
    labels: ["owner", "environment"]

Platform Engineering

Internal Developer Portal (Backstage)

# Software catalog
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: Order processing microservice
  tags:
    - java
    - spring-boot
  annotations:
    github.com/project-slug: myorg/order-service
    pagerduty.com/integration-key: xyz
spec:
  type: service
  lifecycle: production
  owner: team-orders
  system: ecommerce-platform

Golden Paths (Templates)

# Self-service project scaffolding
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: nodejs-service
  title: Node.js Microservice
spec:
  steps:
  - id: fetch-template
    action: fetch:template
    input:
      url: ./skeleton
  - id: create-repo
    action: github:repo:create
  - id: setup-pipeline
    action: github:actions:create
  - id: provision-k8s
    action: argocd:create-app

Benefits

Setup time — Days to minutes (40% reduction in tickets)
Consistency — Standardized patterns across teams
Security — Policies enforced at platform level
Autonomy — Self-service without DevOps bottleneck

Security Scanning (SAST/DAST/SCA)

Testing Types

Type	What	When	Tools
SAST	Static code analysis	Build time	SonarQube, CodeQL, Semgrep
DAST	Runtime testing	After deployment	OWASP ZAP, Burp Suite
SCA	Dependency vulnerabilities	Build + runtime	Trivy, Snyk, Dependabot
Secret Scanning	Detect leaked credentials	Pre-commit + CI	Gitleaks, TruffleHog
Container Scanning	Image vulnerabilities	Build + registry	Trivy, Clair, Grype

Complete Pipeline Integration

# GitHub Actions security workflow
name: Security Scan
on: [push, pull_request]

jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: github/codeql-action/init@v3
      with:
        languages: javascript, python
    - uses: github/codeql-action/analyze@v3

  sca:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Run Trivy SCA
      uses: aquasecurity/trivy-action@master
      with:
        scan-type: 'fs'
        severity: 'CRITICAL,HIGH'

  secrets:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0  # Full history
    - uses: gitleaks/gitleaks-action@v2

  container:
    runs-on: ubuntu-latest
    steps:
    - name: Build image
      run: docker build -t myapp:${{ github.sha }} .
    - name: Scan image
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: myapp:${{ github.sha }}
        severity: 'CRITICAL,HIGH'
        exit-code: 1  # Fail on vulnerabilities

Runtime Security (Falco)

# Detect suspicious container activity
- rule: Shell in Container
  desc: Unexpected shell execution in container
  condition: >
    spawned_process and
    container and
    proc.name in (bash, sh, zsh)
  output: >
    Shell spawned in container
    (user=%user.name container=%container.name
    command=%proc.cmdline)
  priority: WARNING

Metrics and Observability

DORA Metrics (2025 Benchmarks)

Metric	Elite	High	Medium	Low
Deployment Frequency	Multiple/day	Weekly	Monthly	Less than monthly
Lead Time for Changes	< 1 hour	< 1 day	1 week	> 6 months
Mean Time to Recovery	< 1 hour	< 1 day	< 1 week	> 6 months
Change Failure Rate	0-15%	16-30%	31-45%	> 45%

Key Metrics to Track

# Deployment metrics
deployment.frequency: counter
deployment.duration: histogram
deployment.rollback: counter

# Pipeline metrics
pipeline.success_rate: gauge
pipeline.duration: histogram
pipeline.queue_time: histogram

# Feature flag metrics
feature_flag.evaluation: counter
feature_flag.enabled_users: gauge
feature_flag.error_rate: gauge (by flag)

# Resource metrics
pod.cpu_usage: gauge
pod.memory_usage: gauge
pod.restart_count: counter

Checklist

## CI/CD Pipeline
- [ ] Short-lived credentials (OIDC, not static keys)
- [ ] Protected branches for production
- [ ] Parallel jobs for speed
- [ ] Dependency caching configured
- [ ] Build completes in < 10 minutes
- [ ] Security scanning (SAST, SCA, secrets)

## Containers
- [ ] Multi-stage Dockerfile
- [ ] Non-root user (UID > 1000)
- [ ] Minimal base image (alpine/distroless)
- [ ] .dockerignore configured
- [ ] Image scanning in CI
- [ ] Resource limits defined

## Kubernetes
- [ ] Resource requests/limits set
- [ ] Liveness and readiness probes
- [ ] Security context (runAsNonRoot)
- [ ] Network policies defined
- [ ] ConfigMaps/Secrets for config
- [ ] Deployment strategy chosen
- [ ] Image pull policy configured

## Infrastructure as Code
- [ ] Remote state with locking
- [ ] Modular architecture
- [ ] Policy as Code enforcement
- [ ] Automated tests (Terratest/Pulumi tests)
- [ ] Version pinning for providers
- [ ] Environment parity

## Deployments
- [ ] Deployment strategy selected
- [ ] Rollback plan documented
- [ ] Feature flags for large changes
- [ ] Gradual rollout configured
- [ ] Metrics tied to deployments
- [ ] Automated rollback on errors

## Security
- [ ] SAST in pipeline
- [ ] SCA for dependencies
- [ ] Secret scanning enabled
- [ ] Container vulnerability scanning
- [ ] Runtime security monitoring
- [ ] Supply chain security (signed images)

## Observability
- [ ] Deployment frequency tracked
- [ ] Lead time measured
- [ ] MTTR tracked
- [ ] Change failure rate monitored
- [ ] Feature flag metrics
- [ ] Resource utilization dashboards

Install Skill

SKILL.md

DevOps Excellence

Core Principles

Hard Rules (Must Follow)

No Static Credentials

No Root Containers

No Secrets in Images

Protected Production Deployments

Quick Reference

When to Use What

Deployment Strategy Selection

CI/CD Pipeline Best Practices

Pipeline Security

Speed Optimization

Testing Pyramid

Security Scanning Integration

Docker Best Practices

Multi-Stage Builds

Security Hardening

.dockerignore

Kubernetes Deployment Patterns

Resource Management (Right-Sizing)

Health Checks

ConfigMaps and Secrets

Security Best Practices

Infrastructure as Code (Terraform/Pulumi)

Directory Structure

Best Practices

1. Remote State with Locking

2. Modularization

3. Policy as Code

4. Automated Testing

Pulumi Advantages

Release Strategies

Blue-Green Deployment

Canary Deployment

Rolling Update

Feature Flags and Progressive Delivery

Best Practices

1. Flag Lifecycle Management

2. Progressive Rollout

3. Segment Meaningfully

4. Observability Integration

GitOps Practices

Core Principles

ArgoCD Workflow

Repository Structure

Policy Enforcement

Platform Engineering

Internal Developer Portal (Backstage)

Golden Paths (Templates)

Benefits

Security Scanning (SAST/DAST/SCA)

Testing Types

Complete Pipeline Integration

Runtime Security (Falco)

Metrics and Observability

DORA Metrics (2025 Benchmarks)

Key Metrics to Track

Checklist

See Also