Claude Code Plugins

Community-maintained marketplace

Feedback
0
0

DevOps and CI/CD expert. Use when setting up pipelines, containerizing applications, deploying to Kubernetes, or implementing release strategies. Covers GitHub Actions, Docker, K8s, Terraform, and GitOps.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name devops-excellence
description DevOps and CI/CD expert. Use when setting up pipelines, containerizing applications, deploying to Kubernetes, or implementing release strategies. Covers GitHub Actions, Docker, K8s, Terraform, and GitOps.

DevOps Excellence

Core Principles

  • Shift Left — Address security and quality early in SDLC
  • GitOps — Git as single source of truth for infrastructure and deployments
  • Infrastructure as Code — All infrastructure versioned and reproducible
  • Progressive Delivery — Gradual rollouts with feature flags and canary releases
  • Immutable Infrastructure — Replace, don't modify running systems
  • Observability-First — Monitor metrics tied to deployments and features
  • Policy as Code — Enforce compliance and security automatically
  • Platform Engineering — Build golden paths and self-service portals

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

No Static Credentials

Never use long-lived static credentials. Always use OIDC or short-lived tokens.

# ❌ FORBIDDEN: Static AWS credentials
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

# ✅ REQUIRED: OIDC-based authentication
- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubActions
    aws-region: us-east-1
    # No long-lived secrets - uses GitHub OIDC provider

No Root Containers

Containers must NEVER run as root. Always specify a non-root user.

# ❌ FORBIDDEN: Running as root (default)
FROM node:20
WORKDIR /app
CMD ["node", "server.js"]

# ❌ FORBIDDEN: Explicit root user
USER root

# ✅ REQUIRED: Non-root user with UID > 1000
FROM node:20-alpine
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001
USER nodejs
WORKDIR /app
CMD ["node", "server.js"]

No Secrets in Images

Never bake secrets into Docker images. Use runtime injection or secrets managers.

# ❌ FORBIDDEN: Secrets in build args or ENV
ARG DATABASE_PASSWORD
ENV API_KEY=sk-xxx

# ❌ FORBIDDEN: Copying secret files
COPY .env /app/.env
COPY credentials.json /app/

# ✅ REQUIRED: Mount secrets at runtime
# docker run -v /secrets:/app/secrets:ro myapp
# Or use Kubernetes secrets/configmaps

Protected Production Deployments

Production deployments must require approval and be restricted to main branch.

# ❌ FORBIDDEN: Direct production deploy without protection
deploy:
  runs-on: ubuntu-latest
  steps:
    - run: deploy-to-prod.sh

# ✅ REQUIRED: Environment protection
deploy:
  runs-on: ubuntu-latest
  environment:
    name: production
    url: https://myapp.com
  # Requires: approval + main branch only

Quick Reference

When to Use What

Scenario Tool/Pattern Reason
Public GitHub project GitHub Actions Native integration, free for public repos
Enterprise GitLab GitLab CI Unified platform, advanced security scanning
Multi-cloud IaC Terraform Mature ecosystem, wide provider support
Developer-centric IaC Pulumi Real programming languages, better testing
Kubernetes deployments ArgoCD + Kustomize GitOps standard, declarative config
Zero-downtime releases Blue-Green or Canary Instant rollback capability
Gradual feature rollout Feature flags (LaunchDarkly) Progressive delivery with targeting

Deployment Strategy Selection

Strategy Downtime Cost Rollback Speed Complexity Best For
Rolling Minimal Low Medium Low Regular updates, cost-conscious
Blue-Green Zero High (2x) Instant Medium Critical systems, easy rollback
Canary Zero Medium Fast High Risk mitigation, data-driven
Recreate High Low N/A Very Low Non-critical, dev/test only

CI/CD Pipeline Best Practices

Pipeline Security

# Short-lived credentials (not static keys)
- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubActions
    aws-region: us-east-1
    # OIDC provider - no long-lived secrets!

# Protected environments for production
environment:
  name: production
  # Requires approval + restricts to main branch

Speed Optimization

  • 10-minute build rule — Most projects should build in <10 minutes
  • Parallel jobs — Run tests, linting, security scans concurrently
  • Cache dependencies — Cache node_modules, .m2, pip packages
  • Conditional execution — Skip jobs when files haven't changed
# Example: conditional job execution
jobs:
  backend-tests:
    if: contains(github.event.head_commit.modified, 'backend/')
    runs-on: ubuntu-latest

Testing Pyramid

              /\
             /E2E\        <- Few (slow, expensive)
            /------\
           /Integration\ <- Some (medium speed)
          /------------\
         /  Unit Tests  \ <- Many (fast, cheap)
        /----------------\
  • 70% Unit tests (fast, isolated)
  • 20% Integration tests (service interactions)
  • 10% E2E tests (full user workflows)

Security Scanning Integration

# Multi-layer security scanning
jobs:
  security:
    runs-on: ubuntu-latest
    steps:
      # SAST - Static code analysis
      - uses: github/codeql-action/init@v3

      # SCA - Dependency vulnerabilities
      - name: Run Trivy
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          format: 'sarif'

      # Secret scanning
      - name: Gitleaks
        uses: gitleaks/gitleaks-action@v2

      # Container scanning
      - name: Scan Docker image
        run: trivy image myapp:${{ github.sha }}

Docker Best Practices

Multi-Stage Builds

# Build stage - includes build tools (900MB+)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Runtime stage - minimal image (<100MB)
FROM node:20-alpine AS runtime
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001
WORKDIR /app
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --chown=nodejs:nodejs . .
USER nodejs
EXPOSE 3000
CMD ["node", "server.js"]

Security Hardening

  • Non-root user — ALWAYS run as non-root (UID 1001)
  • Minimal base images — Use alpine, distroless, or scratch
  • Read-only filesystemdocker run --read-only
  • No secrets in layers — Use build secrets or external vaults
  • Resource limits — Set CPU/memory limits to prevent DoS
  • Signed images — Enable Docker Content Trust
# Security best practices example
FROM gcr.io/distroless/nodejs20-debian12
COPY --chown=65532:65532 /app /app
USER 65532
EXPOSE 8080

.dockerignore

# Version control
.git
.gitignore

# Dependencies (install fresh in container)
node_modules
vendor/
*.pyc
__pycache__

# Secrets and configs
.env
.env.local
secrets/
*.key
*.pem

# Development files
README.md
Dockerfile
docker-compose.yml
.vscode/
.idea/

# Testing and CI
tests/
*.test.js
.github/

Kubernetes Deployment Patterns

Resource Management (Right-Sizing)

# 99.94% of clusters are over-provisioned!
# Average CPU usage: 10%, Memory: 23%
resources:
  requests:
    memory: "128Mi"  # Guaranteed allocation
    cpu: "100m"      # 0.1 CPU cores
  limits:
    memory: "256Mi"  # Maximum allowed
    cpu: "200m"      # Hard cap

# Use tools: Kubecost, Goldilocks, VPA

Health Checks

# Liveness: Is container alive?
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

# Readiness: Can it receive traffic?
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  successThreshold: 1

# Startup: Has initialization completed?
startupProbe:
  httpGet:
    path: /startup
    port: 8080
  failureThreshold: 30  # 30*10s = 5min for slow starts
  periodSeconds: 10

ConfigMaps and Secrets

# Group related resources in single manifest
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  APP_ENV: production
  LOG_LEVEL: info
---
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:
  DATABASE_URL: postgresql://user:pass@db:5432/mydb
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
      - name: app
        envFrom:
        - configMapRef:
            name: app-config
        - secretRef:
            name: app-secrets

Security Best Practices

# Pod Security Standards
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  seccompProfile:
    type: RuntimeDefault
  capabilities:
    drop:
    - ALL

# Network Policies (deny-by-default)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

Infrastructure as Code (Terraform/Pulumi)

Directory Structure

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   └── prod/
├── modules/
│   ├── vpc/
│   ├── eks/
│   └── rds/
├── backend.tf        # Remote state config
└── versions.tf       # Provider versions

Best Practices

1. Remote State with Locking

# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # Prevents concurrent runs
  }
}

2. Modularization

# modules/vpc/main.tf
variable "cidr_block" {
  type        = string
  description = "VPC CIDR block"
}

resource "aws_vpc" "main" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  tags = {
    Name = "${var.environment}-vpc"
  }
}

# environments/prod/main.tf
module "vpc" {
  source     = "../../modules/vpc"
  cidr_block = "10.0.0.0/16"
  environment = "prod"
}

3. Policy as Code

# Use Sentinel (Terraform Cloud) or OPA
policy "enforce-tags" {
  enforcement_level = "hard-mandatory"

  # Require tags on all resources
  rule {
    condition = all resource.tags contains "Owner"
    error_message = "All resources must have Owner tag"
  }
}

4. Automated Testing

// Terratest example
func TestVPCCreation(t *testing.T) {
  terraformOptions := &terraform.Options{
    TerraformDir: "../environments/dev",
  }

  defer terraform.Destroy(t, terraformOptions)
  terraform.InitAndApply(t, terraformOptions)

  vpcId := terraform.Output(t, terraformOptions, "vpc_id")
  assert.NotEmpty(t, vpcId)
}

Pulumi Advantages

// Pulumi - real programming language benefits
import * as aws from "@pulumi/aws";

const environments = ["dev", "staging", "prod"];

// Use loops, conditionals, functions
environments.forEach(env => {
  new aws.ec2.Vpc(`${env}-vpc`, {
    cidrBlock: env === "prod" ? "10.0.0.0/16" : "10.1.0.0/16",
    tags: { Environment: env },
  });
});

// Built-in testing framework
import * as pulumi from "@pulumi/pulumi";
pulumi.runtime.setMocks(...);

Release Strategies

Blue-Green Deployment

# Two identical environments
# Switch traffic instantly via load balancer

# Step 1: Deploy to Green (idle)
# Step 2: Test Green environment
# Step 3: Switch LB from Blue to Green
# Step 4: Keep Blue as rollback option

# Kubernetes example
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80

When to use:

  • Critical systems requiring instant rollback
  • Compliance requirements for zero downtime
  • Budget allows 2x infrastructure

Canary Deployment

# Gradual rollout: 5% → 25% → 50% → 100%
# Monitor metrics at each stage

# Argo Rollouts example
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10      # 1 pod (10%)
      - pause: {duration: 5m}
      - setWeight: 50      # 5 pods
      - pause: {duration: 10m}
      - setWeight: 100     # All pods
  template:
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0

When to use:

  • High-risk deployments (major refactors)
  • User-facing features needing validation
  • Data-driven rollout decisions

Rolling Update

# Default Kubernetes strategy
# Gradually replace old pods with new

apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Never < 9 pods available
      maxSurge: 2        # Never > 12 pods total

When to use:

  • Regular incremental updates
  • Cost-conscious deployments
  • Low-risk changes

Feature Flags and Progressive Delivery

Best Practices

1. Flag Lifecycle Management

// Avoid "flag debt" - remove after rollout
const featureFlags = {
  // Short-lived (remove after 100% rollout)
  "new-checkout-v4": {
    enabled: true,
    rollout: 100,
    created: "2025-01-15",
    removeBy: "2025-02-15"
  },

  // Long-lived (kill switch)
  "payment-processing": {
    enabled: true,
    permanent: true,  // Document why
    reason: "Emergency shutoff for payment issues"
  }
};

2. Progressive Rollout

// LaunchDarkly example
const showNewFeature = ldClient.variation(
  "new-dashboard-ui",
  user,
  false  // Default fallback
);

// Configuration
{
  "targeting": {
    "rules": [
      {
        "variation": "on",
        "clauses": [
          {
            "attribute": "email",
            "op": "endsWith",
            "values": ["@mycompany.com"]
          }
        ]
      }
    ],
    "rollout": {
      "percentage": 10  // 10% of remaining users
    }
  }
}

3. Segment Meaningfully

  • Geographic: Region-specific rollouts
  • Behavioral: Power users first, then general
  • Technical: Browser/device-based targeting
  • Business: Premium tier vs free tier

4. Observability Integration

// Tie metrics to feature flags
metrics.increment('checkout.completed', {
  feature_flag: 'new-checkout-v4',
  enabled: showNewCheckout
});

// Automatic rollback on error spike
if (errorRate > threshold) {
  ldClient.updateFeatureFlag('new-checkout-v4', { enabled: false });
  alerts.critical('Auto-rollback triggered for new-checkout-v4');
}

GitOps Practices

Core Principles

  1. Declarative — Entire system state in Git
  2. Versioned — Git history = audit trail
  3. Immutable — Git commits are immutable
  4. Automatic — Agents auto-sync cluster to Git state
  5. Continuous — Reconciliation loop detects drift

ArgoCD Workflow

# Application definition
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    targetRevision: main
    path: apps/myapp
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Auto-sync on drift detection
    syncOptions:
    - CreateNamespace=true

Repository Structure

k8s-manifests/
├── apps/
│   ├── myapp/
│   │   ├── base/
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   └── kustomization.yaml
│   │   └── overlays/
│   │       ├── dev/
│   │       ├── staging/
│   │       └── prod/
│   │           ├── kustomization.yaml
│   │           └── replicas-patch.yaml
├── infrastructure/
│   ├── ingress-nginx/
│   └── cert-manager/
└── argocd/
    ├── projects.yaml
    └── applications.yaml

Policy Enforcement

# OPA Gatekeeper - deny images without tags
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-owner-label
spec:
  match:
    kinds:
    - apiGroups: ["apps"]
      kinds: ["Deployment"]
  parameters:
    labels: ["owner", "environment"]

Platform Engineering

Internal Developer Portal (Backstage)

# Software catalog
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: Order processing microservice
  tags:
    - java
    - spring-boot
  annotations:
    github.com/project-slug: myorg/order-service
    pagerduty.com/integration-key: xyz
spec:
  type: service
  lifecycle: production
  owner: team-orders
  system: ecommerce-platform

Golden Paths (Templates)

# Self-service project scaffolding
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: nodejs-service
  title: Node.js Microservice
spec:
  steps:
  - id: fetch-template
    action: fetch:template
    input:
      url: ./skeleton
  - id: create-repo
    action: github:repo:create
  - id: setup-pipeline
    action: github:actions:create
  - id: provision-k8s
    action: argocd:create-app

Benefits

  • Setup time — Days to minutes (40% reduction in tickets)
  • Consistency — Standardized patterns across teams
  • Security — Policies enforced at platform level
  • Autonomy — Self-service without DevOps bottleneck

Security Scanning (SAST/DAST/SCA)

Testing Types

Type What When Tools
SAST Static code analysis Build time SonarQube, CodeQL, Semgrep
DAST Runtime testing After deployment OWASP ZAP, Burp Suite
SCA Dependency vulnerabilities Build + runtime Trivy, Snyk, Dependabot
Secret Scanning Detect leaked credentials Pre-commit + CI Gitleaks, TruffleHog
Container Scanning Image vulnerabilities Build + registry Trivy, Clair, Grype

Complete Pipeline Integration

# GitHub Actions security workflow
name: Security Scan
on: [push, pull_request]

jobs:
  sast:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: github/codeql-action/init@v3
      with:
        languages: javascript, python
    - uses: github/codeql-action/analyze@v3

  sca:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Run Trivy SCA
      uses: aquasecurity/trivy-action@master
      with:
        scan-type: 'fs'
        severity: 'CRITICAL,HIGH'

  secrets:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
      with:
        fetch-depth: 0  # Full history
    - uses: gitleaks/gitleaks-action@v2

  container:
    runs-on: ubuntu-latest
    steps:
    - name: Build image
      run: docker build -t myapp:${{ github.sha }} .
    - name: Scan image
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: myapp:${{ github.sha }}
        severity: 'CRITICAL,HIGH'
        exit-code: 1  # Fail on vulnerabilities

Runtime Security (Falco)

# Detect suspicious container activity
- rule: Shell in Container
  desc: Unexpected shell execution in container
  condition: >
    spawned_process and
    container and
    proc.name in (bash, sh, zsh)
  output: >
    Shell spawned in container
    (user=%user.name container=%container.name
    command=%proc.cmdline)
  priority: WARNING

Metrics and Observability

DORA Metrics (2025 Benchmarks)

Metric Elite High Medium Low
Deployment Frequency Multiple/day Weekly Monthly Less than monthly
Lead Time for Changes < 1 hour < 1 day 1 week > 6 months
Mean Time to Recovery < 1 hour < 1 day < 1 week > 6 months
Change Failure Rate 0-15% 16-30% 31-45% > 45%

Key Metrics to Track

# Deployment metrics
deployment.frequency: counter
deployment.duration: histogram
deployment.rollback: counter

# Pipeline metrics
pipeline.success_rate: gauge
pipeline.duration: histogram
pipeline.queue_time: histogram

# Feature flag metrics
feature_flag.evaluation: counter
feature_flag.enabled_users: gauge
feature_flag.error_rate: gauge (by flag)

# Resource metrics
pod.cpu_usage: gauge
pod.memory_usage: gauge
pod.restart_count: counter

Checklist

## CI/CD Pipeline
- [ ] Short-lived credentials (OIDC, not static keys)
- [ ] Protected branches for production
- [ ] Parallel jobs for speed
- [ ] Dependency caching configured
- [ ] Build completes in < 10 minutes
- [ ] Security scanning (SAST, SCA, secrets)

## Containers
- [ ] Multi-stage Dockerfile
- [ ] Non-root user (UID > 1000)
- [ ] Minimal base image (alpine/distroless)
- [ ] .dockerignore configured
- [ ] Image scanning in CI
- [ ] Resource limits defined

## Kubernetes
- [ ] Resource requests/limits set
- [ ] Liveness and readiness probes
- [ ] Security context (runAsNonRoot)
- [ ] Network policies defined
- [ ] ConfigMaps/Secrets for config
- [ ] Deployment strategy chosen
- [ ] Image pull policy configured

## Infrastructure as Code
- [ ] Remote state with locking
- [ ] Modular architecture
- [ ] Policy as Code enforcement
- [ ] Automated tests (Terratest/Pulumi tests)
- [ ] Version pinning for providers
- [ ] Environment parity

## Deployments
- [ ] Deployment strategy selected
- [ ] Rollback plan documented
- [ ] Feature flags for large changes
- [ ] Gradual rollout configured
- [ ] Metrics tied to deployments
- [ ] Automated rollback on errors

## Security
- [ ] SAST in pipeline
- [ ] SCA for dependencies
- [ ] Secret scanning enabled
- [ ] Container vulnerability scanning
- [ ] Runtime security monitoring
- [ ] Supply chain security (signed images)

## Observability
- [ ] Deployment frequency tracked
- [ ] Lead time measured
- [ ] MTTR tracked
- [ ] Change failure rate monitored
- [ ] Feature flag metrics
- [ ] Resource utilization dashboards

See Also