title	devops-flow: DevOps, MLOps, DevSecOps practices for cloud environments
name	devops-flow
description	DevOps, MLOps, DevSecOps practices for cloud environments (GCP, Azure, AWS)
tags	sdd-workflow, shared-architecture, domain-specific
custom_fields	[object Object]

devops-flow

Description: Infrastructure as Code, CI/CD pipeline automation, and deployment management

Category: DevOps & Deployment

Complexity: High (multi-cloud + orchestration + automation)

Purpose

Automate infrastructure provisioning, CI/CD pipelines, and deployment processes based on SPEC documents and ADR decisions. Ensures consistent, repeatable, and secure deployments across environments.

Capabilities

1. Infrastructure as Code (IaC)

Terraform: Cloud-agnostic infrastructure provisioning
CloudFormation: AWS-native infrastructure
Ansible: Configuration management and provisioning
Pulumi: Modern IaC with standard programming languages
Kubernetes manifests: Container orchestration

2. CI/CD Pipeline Generation

GitHub Actions: Workflow automation
GitLab CI: Pipeline configuration
Jenkins: Pipeline as code
CircleCI: Cloud-native CI/CD
Azure DevOps: Microsoft ecosystem integration

3. Container Configuration

Dockerfile: Container image definition
Docker Compose: Multi-container applications
Kubernetes: Production orchestration
Helm charts: Kubernetes package management
Container registry: Image storage and versioning

4. Deployment Strategies

Blue-Green: Zero-downtime deployments
Canary: Gradual rollout with monitoring
Rolling: Sequential instance updates
Feature flags: Progressive feature enablement
Rollback procedures: Automated failure recovery

5. Environment Management

Environment separation: dev, staging, production
Configuration management: Environment-specific configs
Secret management: Vault, AWS Secrets Manager, etc.
Infrastructure versioning: State management
Cost optimization: Resource tagging and monitoring

6. Monitoring & Observability

Logging: Centralized log aggregation
Metrics: Performance and health monitoring
Alerting: Incident response automation
Tracing: Distributed request tracking
Dashboards: Real-time visualization

7. Security & Compliance

Security scanning: Container and infrastructure
Compliance checks: Policy enforcement
Access control: IAM and RBAC
Network security: Firewall rules, VPC configuration
Audit logging: Change tracking

8. Disaster Recovery

Backup automation: Data and configuration backups
Recovery procedures: Automated restoration
Failover: Multi-region redundancy
Data replication: Cross-region sync
RTO/RPO: Recovery objectives implementation

DevOps Workflow

graph TD
    A[SPEC Document] --> B[Extract Requirements]
    B --> C{Infrastructure Needed?}
    C -->|Yes| D[Generate IaC Templates]
    C -->|No| E[Generate CI/CD Pipeline]

    D --> F[Terraform/CloudFormation]
    F --> G[Validate Infrastructure Code]
    G --> H{Validation Pass?}
    H -->|No| I[Report Issues]
    H -->|Yes| J[Generate Deployment Pipeline]

    E --> J
    J --> K[CI/CD Configuration]
    K --> L[Add Build Stage]
    L --> M[Add Test Stage]
    M --> N[Add Security Scan]
    N --> O[Add Deploy Stage]

    O --> P[Environment Strategy]
    P --> Q{Deployment Type}
    Q -->|Blue-Green| R[Generate Blue-Green Config]
    Q -->|Canary| S[Generate Canary Config]
    Q -->|Rolling| T[Generate Rolling Config]

    R --> U[Add Monitoring]
    S --> U
    T --> U

    U --> V[Add Rollback Procedure]
    V --> W[Generate Documentation]
    W --> X[Review & Deploy]

    I --> X

Usage Instructions

Generate Infrastructure from SPEC

devops-flow generate-infra \
  --spec specs/SPEC-API-V1.md \
  --cloud aws \
  --output infrastructure/

Generated Terraform structure:

infrastructure/
├── main.tf              # Main configuration
├── variables.tf         # Input variables
├── outputs.tf           # Output values
├── providers.tf         # Cloud provider config
├── modules/
│   ├── vpc/            # Network infrastructure
│   ├── compute/        # EC2, Lambda, etc.
│   ├── database/       # RDS, DynamoDB
│   └── storage/        # S3, EBS
└── environments/
    ├── dev.tfvars      # Development config
    ├── staging.tfvars  # Staging config
    └── prod.tfvars     # Production config

Generate CI/CD Pipeline

devops-flow generate-pipeline \
  --type github-actions \
  --language python \
  --deploy-strategy blue-green \
  --output .github/workflows/

Generated GitHub Actions workflow:

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  PYTHON_VERSION: '3.11'
  AWS_REGION: us-east-1

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ env.PYTHON_VERSION }}
      - name: Install dependencies
        run: |
          pip install ruff mypy
      - name: Run linters
        run: |
          ruff check .
          mypy .

  test:
    needs: lint
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run tests
        run: |
          pytest --cov=. --cov-report=xml
      - name: Upload coverage
        uses: codecov/codecov-action@v3

  security:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Security scan
        run: |
          bandit -r . -f json -o security-report.json
      - name: Upload security report
        uses: actions/upload-artifact@v3
        with:
          name: security-report
          path: security-report.json

  build:
    needs: [lint, test, security]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Build Docker image
        run: |
          docker build -t app:${{ github.sha }} .
      - name: Push to registry
        run: |
          docker push registry.example.com/app:${{ github.sha }}

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/develop'
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Deploy to staging
        run: |
          aws ecs update-service \
            --cluster staging-cluster \
            --service app-service \
            --force-new-deployment

  deploy-production:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Deploy blue-green
        run: |
          # Deploy to green environment
          ./scripts/deploy-green.sh
          # Run smoke tests
          ./scripts/smoke-tests.sh
          # Switch traffic to green
          ./scripts/switch-traffic.sh
          # Keep blue for rollback

Generate Kubernetes Configuration

devops-flow generate-k8s \
  --spec specs/SPEC-API-V1.md \
  --replicas 3 \
  --output k8s/

Generated Kubernetes manifests:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  labels:
    app: api
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
        version: v1
    spec:
      containers:
      - name: api
        image: registry.example.com/api:latest
        ports:
        - containerPort: 8000
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: database-url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

---
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Infrastructure Templates

AWS Infrastructure (Terraform)

# main.tf
terraform {
  required_version = ">= 1.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  backend "s3" {
    bucket = "terraform-state-bucket"
    key    = "infrastructure/terraform.tfstate"
    region = "us-east-1"
  }
}

provider "aws" {
  region = var.aws_region
  default_tags {
    tags = {
      Project     = var.project_name
      Environment = var.environment
      ManagedBy   = "Terraform"
    }
  }
}

# VPC Module
module "vpc" {
  source = "./modules/vpc"

  vpc_cidr            = var.vpc_cidr
  availability_zones  = var.availability_zones
  public_subnet_cidrs = var.public_subnet_cidrs
  private_subnet_cidrs = var.private_subnet_cidrs
}

# ECS Cluster
resource "aws_ecs_cluster" "main" {
  name = "${var.project_name}-${var.environment}-cluster"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

# Application Load Balancer
resource "aws_lb" "main" {
  name               = "${var.project_name}-${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = module.vpc.public_subnet_ids

  enable_deletion_protection = var.environment == "production"
}

# RDS Database
resource "aws_db_instance" "main" {
  identifier     = "${var.project_name}-${var.environment}-db"
  engine         = "postgres"
  engine_version = "15.3"
  instance_class = var.db_instance_class

  allocated_storage     = var.db_allocated_storage
  max_allocated_storage = var.db_max_allocated_storage
  storage_encrypted     = true

  db_name  = var.db_name
  username = var.db_username
  password = random_password.db_password.result

  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = var.environment == "production" ? 30 : 7
  skip_final_snapshot     = var.environment != "production"

  tags = {
    Name = "${var.project_name}-${var.environment}-db"
  }
}

Docker Configuration

# Dockerfile
FROM python:3.11-slim as base

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Create non-root user
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# Multi-stage build for smaller image
FROM base as production
ENV ENVIRONMENT=production
RUN pip install --no-cache-dir gunicorn
CMD ["gunicorn", "main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000"]

Docker Compose (Local Development)

# docker-compose.yml
version: '3.8'

services:
  api:
    build:
      context: .
      dockerfile: Dockerfile
      target: base
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:password@db:5432/appdb
      - REDIS_URL=redis://redis:6379/0
      - ENVIRONMENT=development
    volumes:
      - .:/app
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started
    command: uvicorn main:app --host 0.0.0.0 --port 8000 --reload

  db:
    image: postgres:15
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: password
      POSTGRES_DB: appdb
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - api

volumes:
  postgres_data:
  redis_data:

Deployment Strategies

Blue-Green Deployment

#!/bin/bash
# deploy-blue-green.sh

set -e

BLUE_ENV="production-blue"
GREEN_ENV="production-green"
CURRENT_ENV=$(get_active_environment)

if [ "$CURRENT_ENV" == "$BLUE_ENV" ]; then
    TARGET_ENV="$GREEN_ENV"
    OLD_ENV="$BLUE_ENV"
else
    TARGET_ENV="$BLUE_ENV"
    OLD_ENV="$GREEN_ENV"
fi

echo "Deploying to $TARGET_ENV (current: $OLD_ENV)"

# Deploy to target environment
deploy_to_environment "$TARGET_ENV"

# Run smoke tests
if ! run_smoke_tests "$TARGET_ENV"; then
    echo "Smoke tests failed, rolling back"
    exit 1
fi

# Switch traffic
switch_load_balancer "$TARGET_ENV"

# Monitor for 5 minutes
monitor_environment "$TARGET_ENV" 300

# If all good, keep old environment for quick rollback
echo "Deployment successful. Old environment $OLD_ENV kept for rollback."

Canary Deployment

# k8s/canary-deployment.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 8000

---
# Stable version (90% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: api
      version: stable
  template:
    metadata:
      labels:
        app: api
        version: stable
    spec:
      containers:
      - name: api
        image: registry.example.com/api:v1.0.0

---
# Canary version (10% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api
      version: canary
  template:
    metadata:
      labels:
        app: api
        version: canary
    spec:
      containers:
      - name: api
        image: registry.example.com/api:v1.1.0

Monitoring & Observability

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: api

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/alerts/*.yml

Alert Rules

# alerts/api-alerts.yml
groups:
  - name: api-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "{{ $labels.instance }} has error rate {{ $value }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s"

      - alert: PodDown
        expr: up{job="api-service"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pod is down"
          description: "{{ $labels.instance }} has been down for 2 minutes"

Security Configuration

Network Security

# security-groups.tf

# ALB Security Group
resource "aws_security_group" "alb" {
  name_prefix = "${var.project_name}-alb-"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTPS from internet"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Application Security Group
resource "aws_security_group" "app" {
  name_prefix = "${var.project_name}-app-"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port       = 8000
    to_port         = 8000
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "From ALB"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Database Security Group
resource "aws_security_group" "db" {
  name_prefix = "${var.project_name}-db-"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
    description     = "From application"
  }
}

Tool Access

Required tools:

Read: Read SPEC documents and ADRs
Write: Generate infrastructure and pipeline files
Bash: Execute Terraform, Docker, kubectl commands
Grep: Search for configuration patterns

Required software:

Terraform / OpenTofu
Docker / Podman
kubectl / helm
aws-cli / gcloud / az-cli
Ansible (optional)

Integration Points

With doc-flow

Extract infrastructure requirements from SPEC documents
Validate ADR compliance in infrastructure code
Generate deployment documentation

With security-audit

Security scanning of infrastructure code
Vulnerability assessment of containers
Compliance validation

With test-automation

Integration with CI/CD for automated testing
Deployment smoke tests
Infrastructure validation tests

With analytics-flow

Deployment metrics and trends
Infrastructure cost tracking
Performance monitoring integration

Best Practices

Infrastructure as Code: All infrastructure versioned in Git
Immutable infrastructure: Replace, don't modify
Environment parity: Dev/staging/prod consistency
Secret management: Never commit secrets
Monitoring from day one: Observability built-in
Automated rollbacks: Fast failure recovery
Cost optimization: Tag resources, monitor spending
Security by default: Least privilege, encryption
Documentation: Runbooks for common operations
Disaster recovery: Regular backup testing

Success Criteria

Zero manual infrastructure provisioning
Deployment time < 15 minutes
Rollback time < 5 minutes
Zero-downtime deployments
Infrastructure drift detection automated
Security compliance 100%
Cost variance < 10% from budget

Notes

Generated configurations require review before production use
Cloud provider credentials must be configured separately
State management (Terraform) requires backend configuration
Multi-region deployments require additional configuration
Cost estimation available with terraform plan

devops-flow

Install Skill

SKILL.md

devops-flow

Purpose

Capabilities

1. Infrastructure as Code (IaC)

2. CI/CD Pipeline Generation

3. Container Configuration

4. Deployment Strategies

5. Environment Management

6. Monitoring & Observability

7. Security & Compliance

8. Disaster Recovery

DevOps Workflow

Usage Instructions

Generate Infrastructure from SPEC

Generate CI/CD Pipeline

Generate Kubernetes Configuration

Infrastructure Templates

AWS Infrastructure (Terraform)

Docker Configuration

Docker Compose (Local Development)

Deployment Strategies

Blue-Green Deployment

Canary Deployment

Monitoring & Observability

Prometheus Configuration

Alert Rules

Security Configuration

Network Security

Tool Access

Integration Points

With doc-flow

With security-audit

With test-automation

With analytics-flow

Best Practices

Success Criteria

Notes