| name | cloud-infrastructure |
| description | Cloud infrastructure design and deployment patterns for AWS, Azure, and GCP. Use when designing cloud architectures, implementing IaC with Terraform, optimizing costs, or setting up multi-region deployments. |
| author | Joseph OBrien |
| status | unpublished |
| updated | 2025-12-23 |
| version | 1.0.1 |
| tag | skill |
| type | skill |
Cloud Infrastructure
Comprehensive cloud infrastructure skill covering multi-cloud architecture, Infrastructure as Code, cost optimization, and production deployment patterns.
When to Use This Skill
- Designing cloud architecture for new applications
- Implementing Infrastructure as Code (Terraform, CloudFormation, Pulumi)
- Cost optimization and resource right-sizing
- Multi-region and high-availability deployments
- Cloud migration planning
- Security and compliance implementation
- Auto-scaling and performance optimization
Cloud Architecture Patterns
Compute Patterns
| Pattern | AWS | Azure | GCP | Use Case |
|---|---|---|---|---|
| Serverless | Lambda | Functions | Cloud Functions | Event-driven, variable load |
| Containers | ECS/EKS | AKS | GKE | Microservices, consistent env |
| VMs | EC2 | Virtual Machines | Compute Engine | Legacy apps, full control |
| Batch | Batch | Batch | Batch | Large-scale processing |
Storage Patterns
| Type | AWS | Azure | GCP | Use Case |
|---|---|---|---|---|
| Object | S3 | Blob Storage | Cloud Storage | Static files, backups |
| Block | EBS | Managed Disks | Persistent Disk | Database storage |
| File | EFS | Azure Files | Filestore | Shared file systems |
| Archive | Glacier | Archive | Coldline | Long-term retention |
Database Patterns
| Type | AWS | Azure | GCP | Use Case |
|---|---|---|---|---|
| Relational | RDS, Aurora | SQL Database | Cloud SQL | ACID transactions |
| NoSQL | DynamoDB | Cosmos DB | Firestore | Flexible schema |
| Cache | ElastiCache | Cache for Redis | Memorystore | Session, caching |
| Data Warehouse | Redshift | Synapse | BigQuery | Analytics |
Infrastructure as Code
Terraform Best Practices
Project Structure:
infrastructure/
├── modules/
│ ├── networking/
│ ├── compute/
│ └── database/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
├── main.tf
├── variables.tf
├── outputs.tf
└── versions.tf
State Management:
- Use remote state (S3, Azure Blob, GCS)
- Enable state locking (DynamoDB, Blob lease)
- Separate state per environment
- Never commit state files
Module Design:
- Single responsibility per module
- Expose minimal required variables
- Document inputs/outputs
- Version modules with git tags
Cost Optimization
Compute Savings:
- Reserved Instances (1-3 year commitment): 30-60% savings
- Spot/Preemptible instances: 60-90% savings for interruptible workloads
- Right-sizing: Match instance size to actual usage
- Auto-scaling: Scale down during low usage
Storage Savings:
- Lifecycle policies: Auto-transition to cheaper tiers
- Compression: Reduce storage footprint
- Deduplication: Eliminate redundant data
- Delete unused resources: Orphaned volumes, snapshots
Network Savings:
- Use CDN for static content
- Optimize data transfer paths
- Use private endpoints
- Compress API responses
High Availability Patterns
Multi-AZ Deployment
- Deploy across 2-3 availability zones
- Use load balancers for distribution
- Database replication across AZs
- Automatic failover configuration
Multi-Region Deployment
- Active-active or active-passive
- DNS-based routing (Route53, Traffic Manager)
- Data replication strategy
- Disaster recovery procedures
Resilience Patterns
- Circuit breakers for external dependencies
- Retry with exponential backoff
- Bulkhead isolation
- Graceful degradation
Security Best Practices
Identity & Access
- Principle of least privilege
- Use IAM roles, not long-term credentials
- Enable MFA for privileged accounts
- Regular access reviews
Network Security
- VPC/VNet isolation
- Security groups as firewalls
- Private subnets for backend services
- VPN/Direct Connect for hybrid
Data Protection
- Encryption at rest (KMS)
- Encryption in transit (TLS)
- Key rotation policies
- Backup and recovery testing
Monitoring & Observability
Key Metrics
- CPU, Memory, Disk utilization
- Network throughput and latency
- Error rates and types
- Cost per service/team
Alerting Strategy
- Set thresholds based on baselines
- Alert on symptoms, not causes
- Runbooks for each alert
- Escalation paths defined
Reference Files
references/terraform_patterns.md- IaC patterns and examplesreferences/cost_optimization.md- Detailed cost reduction strategies
Integration with Other Skills
- security-engineering - For security architecture
- network-engineering - For network design
- performance - For optimization strategies
- devops-runbooks - For operational procedures