Cost Optimization Workflow
This skill defines the structured process for cloud cost optimization. Use it for systematic cost analysis and data-driven optimization.
Cost Optimization Phases
| Phase |
Focus |
Output |
| 1. Cost Visibility |
Understand current spend |
Cost breakdown |
| 2. Anomaly Detection |
Identify unusual spend |
Anomaly report |
| 3. Optimization Analysis |
Find savings opportunities |
Opportunities list |
| 4. Risk Assessment |
Evaluate optimization risks |
Risk matrix |
| 5. Implementation |
Execute optimizations |
Cost reduction |
| 6. Monitoring |
Track savings |
Savings report |
Phase 1: Cost Visibility
Cost Breakdown Dimensions
Analyze costs across multiple dimensions:
| Dimension |
Purpose |
Tool |
| Service |
Which AWS services cost most |
Cost Explorer |
| Account |
Which accounts spend most |
Cost Explorer |
| Tag |
Cost by team/project/environment |
Cost Allocation Tags |
| Resource |
Individual resource costs |
Cost Explorer |
| Time |
Cost trends over time |
Cost Explorer |
Cost Visibility Template
## Cost Visibility Report
**Period:** [Month YYYY]
**Total Spend:** $XX,XXX
**Budget:** $XX,XXX
**Variance:** [+/-X%]
### Cost by Service
| Service | Cost | % of Total | MoM Change |
|---------|------|------------|------------|
| EC2 | $X,XXX | XX% | +X% |
| RDS | $X,XXX | XX% | +X% |
| S3 | $X,XXX | XX% | +X% |
| Data Transfer | $X,XXX | XX% | +X% |
| Other | $X,XXX | XX% | +X% |
### Cost by Environment
| Environment | Cost | % of Total |
|-------------|------|------------|
| Production | $X,XXX | XX% |
| Staging | $X,XXX | XX% |
| Development | $X,XXX | XX% |
### Cost by Team
| Team | Cost | % of Total |
|------|------|------------|
| Platform | $X,XXX | XX% |
| API | $X,XXX | XX% |
| Data | $X,XXX | XX% |
Tagging Requirements
Minimum required tags for cost allocation:
| Tag |
Purpose |
Example |
Environment |
Env separation |
prod, staging, dev |
Team |
Cost ownership |
platform, api, data |
Service |
Service identification |
api-gateway, auth |
CostCenter |
Financial allocation |
CC-1234 |
Phase 2: Anomaly Detection
Anomaly Detection Rules
| Rule |
Threshold |
Alert |
| Daily spend spike |
>20% vs 7-day avg |
Warning |
| Service cost jump |
>50% vs last month |
Critical |
| New service appears |
Any new service >$100/day |
Info |
| Tag coverage drop |
<95% coverage |
Warning |
Anomaly Investigation
When anomaly detected:
Identify the spike:
- Which service/resource?
- When did it start?
- What changed?
Check common causes:
- New deployment
- Traffic increase
- Data growth
- Misconfiguration
- Forgotten resources
Validate intentionality:
- Expected growth?
- Approved change?
- One-time vs recurring?
Anomaly Report Template
## Cost Anomaly Report
**Detected:** YYYY-MM-DD HH:MM
**Severity:** [Critical/Warning/Info]
### Anomaly Details
| Metric | Expected | Actual | Delta |
|--------|----------|--------|-------|
| Daily spend | $X,XXX | $X,XXX | +XX% |
### Investigation
**Root Cause:** [description]
**Contributing Factors:**
1. [Factor 1]
2. [Factor 2]
**Intentional:** [Yes/No]
### Action Required
- [ ] [Action if remediation needed]
- [ ] [Update budget if expected]
Phase 3: Optimization Analysis
Optimization Categories
| Category |
Typical Savings |
Effort |
Risk |
| Rightsizing |
20-40% |
Low |
Low |
| Reserved Capacity |
30-70% |
Medium |
Low-Medium |
| Spot Instances |
60-90% |
Medium |
Medium |
| Storage Tiering |
20-50% |
Low |
Low |
| Idle Resources |
100% |
Low |
None |
| Data Transfer |
10-30% |
Medium |
Low |
Rightsizing Analysis
## Rightsizing Opportunities
### Underutilized Instances
| Instance | Type | Avg CPU | Avg Mem | Recommendation | Savings |
|----------|------|---------|---------|----------------|---------|
| api-prod-1 | m5.xlarge | 15% | 25% | m5.large | $70/mo |
| worker-2 | c5.2xlarge | 30% | 20% | c5.xlarge | $140/mo |
### Criteria Used
- CPU avg <40% over 14 days -> downsize candidate
- Memory avg <50% over 14 days -> downsize candidate
- Excluded: ASG instances (handled by ASG sizing)
Reserved Instance Analysis
## Reserved Instance Coverage
### Current Coverage
| Service | On-Demand | Reserved | Coverage |
|---------|-----------|----------|----------|
| EC2 | $5,000 | $3,000 | 38% |
| RDS | $2,000 | $0 | 0% |
| ElastiCache | $500 | $500 | 50% |
### RI Recommendations
| Resource Type | Term | Payment | Monthly Savings | Break-even |
|---------------|------|---------|-----------------|------------|
| 10x m5.large | 1 year | No upfront | $350 | 0 months |
| db.r5.xlarge | 1 year | Partial | $180 | 4 months |
### RI Purchase Criteria
- Stable workload for >80% of term
- Usage predictable for commitment period
- Consider convertible RIs for flexibility
Idle Resource Detection
## Idle Resources
### Unattached EBS Volumes
| Volume ID | Size | Cost/Month | Last Attached |
|-----------|------|------------|---------------|
| vol-xxx | 100GB | $10 | 90 days ago |
| vol-yyy | 500GB | $50 | Never |
### Unused Elastic IPs
| IP | Allocation ID | Associated | Cost/Month |
|----|---------------|------------|------------|
| x.x.x.x | eipalloc-xxx | No | $3.60 |
### Idle Load Balancers
| LB Name | Target Groups | Requests/Day | Cost/Month |
|---------|---------------|--------------|------------|
| old-api | 0 | 0 | $16.50 |
Phase 4: Risk Assessment
Optimization Risk Matrix
| Optimization |
Risk Level |
Potential Impact |
Mitigation |
| Downsize instance |
Low |
Performance degradation |
Monitor, quick rollback |
| Purchase RI |
Low-Medium |
Unused commitment |
Convertible RIs |
| Spot instances |
Medium |
Instance interruption |
Diversify, checkpointing |
| Delete idle |
None-Low |
Lost data (if EBS) |
Snapshot first |
| Storage tiering |
Low |
Retrieval latency |
Test access patterns |
Risk Assessment Checklist
Phase 5: Implementation
Implementation Priority
| Priority |
Criteria |
Examples |
| Quick Wins |
Low effort, no risk, immediate savings |
Delete idle resources |
| High Impact |
Significant savings, manageable risk |
RI purchases |
| Medium Impact |
Moderate savings, requires planning |
Rightsizing |
| Long-term |
Architectural changes |
Spot migration |
Implementation Checklist
Phase 6: Monitoring
Savings Tracking
## Savings Report
**Period:** [Month YYYY]
**Target Savings:** $X,XXX
**Actual Savings:** $X,XXX
**Achievement:** XX%
### Savings by Category
| Category | Target | Actual | Status |
|----------|--------|--------|--------|
| Rightsizing | $500 | $450 | 90% |
| Reserved Instances | $2,000 | $2,100 | 105% |
| Idle Resources | $200 | $200 | 100% |
### Monthly Trend
| Month | Spend | Savings | Cumulative |
|-------|-------|---------|------------|
| Jan | $50,000 | $0 | $0 |
| Feb | $48,000 | $2,000 | $2,000 |
| Mar | $47,500 | $2,500 | $4,500 |
Anti-Rationalization Table
| Rationalization |
Why It's WRONG |
Required Action |
| "Small savings not worth it" |
Small savings compound |
Evaluate ALL opportunities |
| "RIs are too risky" |
RI risk is manageable |
Analyze stable workloads |
| "Dev doesn't need optimization" |
Dev is often 30%+ of cost |
Optimize ALL environments |
| "Can't predict future usage" |
Historical data helps |
Use data-driven forecasting |
| "Optimization takes too much time" |
ROI on optimization is high |
Invest in systematic process |
Pressure Resistance
| User Says |
Your Response |
| "Just cut costs by 30%" |
"Cannot proceed without analysis. Blind cuts cause outages. Will provide data-driven recommendations." |
| "Skip the analysis, buy RIs" |
"RI purchases require usage analysis. Wrong RIs waste money. Analysis required first." |
| "Dev environment is fine as-is" |
"Dev costs are significant. Optimization applies to all environments." |
Dispatch Specialist
For cost optimization tasks, dispatch:
Task tool:
subagent_type: "cloud-cost-optimizer"
model: "opus"
prompt: |
COST ANALYSIS REQUEST
Scope: [accounts/services to analyze]
Period: [time range]
Focus: [rightsizing/RI/general optimization]
Constraints: [budget targets, risk tolerance]