Capacity Planning Workflow
This skill defines the structured process for infrastructure capacity planning. Use it for proactive capacity management and growth forecasting.
Capacity Planning Phases
| Phase |
Focus |
Output |
| 1. Current State |
Document existing capacity |
Capacity baseline |
| 2. Usage Analysis |
Analyze utilization patterns |
Utilization report |
| 3. Growth Forecast |
Project future requirements |
Growth model |
| 4. Gap Analysis |
Identify capacity gaps |
Gap report |
| 5. Recommendations |
Scaling strategy |
Capacity plan |
| 6. Implementation |
Execute capacity changes |
Updated infrastructure |
Phase 1: Current State Assessment
Data Collection
Gather the following for each service tier:
| Metric |
Compute |
Database |
Storage |
Network |
| Provisioned |
Instance count/size |
Instance class |
Total GB |
Bandwidth |
| Peak utilization |
CPU/Memory % |
Connections/IOPS |
Usage % |
Throughput |
| Average utilization |
CPU/Memory % |
Connections/IOPS |
Growth rate |
Latency |
| Cost |
Monthly $ |
Monthly $ |
Monthly $ |
Monthly $ |
Data Sources
# AWS CLI examples
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization
aws rds describe-db-instances
aws s3api list-buckets
aws ce get-cost-and-usage
Current State Template
## Current Capacity Baseline
**Assessment Date:** YYYY-MM-DD
**Scope:** [production/staging/all]
### Compute Resources
| Service | Instance Type | Count | Avg CPU | Avg Memory | Cost/Month |
|---------|--------------|-------|---------|------------|------------|
| api | m5.xlarge | 10 | 45% | 60% | $2,400 |
| worker | c5.2xlarge | 5 | 70% | 40% | $1,800 |
### Database Resources
| Database | Instance Class | Storage | Avg Connections | Avg IOPS | Cost/Month |
|----------|---------------|---------|-----------------|----------|------------|
| primary | db.r5.2xlarge | 500GB | 150 | 5000 | $1,800 |
### Storage Resources
| Bucket/Volume | Type | Size | Growth Rate | Cost/Month |
|---------------|------|------|-------------|------------|
| logs | S3 Standard | 2TB | 100GB/month | $46 |
Phase 2: Usage Analysis
Utilization Patterns
Identify patterns in resource usage:
| Pattern |
Description |
Scaling Strategy |
| Steady |
Consistent load |
Reserved capacity |
| Cyclical |
Predictable peaks |
Scheduled scaling |
| Spiky |
Unpredictable bursts |
Auto-scaling |
| Growing |
Steady increase |
Proactive provisioning |
Analysis Questions
- What is peak vs average utilization?
- When do peaks occur? (time of day, day of week)
- What triggers traffic spikes? (campaigns, events)
- What is the headroom at peak? (safety margin)
- Are there correlated resources? (if A scales, B must scale)
Utilization Thresholds
| Metric |
Healthy |
Warning |
Critical |
| CPU |
<70% |
70-85% |
>85% |
| Memory |
<75% |
75-90% |
>90% |
| Storage |
<70% |
70-85% |
>85% |
| DB Connections |
<70% |
70-85% |
>85% |
Phase 3: Growth Forecasting
Forecasting Methods
| Method |
Best For |
Accuracy |
| Linear extrapolation |
Steady growth |
Moderate |
| Seasonal decomposition |
Cyclical patterns |
High |
| Business-driven |
New product launches |
Varies |
| Historical comparison |
Similar past events |
Moderate |
Growth Forecast Template
## Growth Forecast
**Forecast Period:** [Q1 2024 / 6 months / etc.]
**Methodology:** [method used]
**Confidence:** [High/Medium/Low]
### Traffic Projections
| Metric | Current | +3 Months | +6 Months | +12 Months |
|--------|---------|-----------|-----------|------------|
| Requests/sec | 1,000 | 1,200 | 1,500 | 2,000 |
| DAU | 50,000 | 60,000 | 75,000 | 100,000 |
| Data volume | 500GB | 600GB | 750GB | 1TB |
### Key Assumptions
1. [Assumption 1 - e.g., no major product launches]
2. [Assumption 2 - e.g., 20% YoY growth continues]
3. [Assumption 3 - e.g., no seasonal events]
### Risk Factors
| Factor | Impact | Likelihood | Mitigation |
|--------|--------|------------|------------|
| Viral growth | +200% traffic | Low | Auto-scaling limits |
| Marketing campaign | +50% traffic | Medium | Pre-scale before launch |
Phase 4: Gap Analysis
Capacity Gap Identification
Compare current capacity against forecast requirements:
## Gap Analysis
### Compute Gaps
| Service | Current Capacity | Needed (+6mo) | Gap | Severity |
|---------|------------------|---------------|-----|----------|
| api | 10 x m5.xlarge | 15 x m5.xlarge | +5 | Medium |
| worker | 5 x c5.2xlarge | 8 x c5.2xlarge | +3 | High |
### Database Gaps
| Database | Current | Needed | Gap | Notes |
|----------|---------|--------|-----|-------|
| primary | db.r5.2xlarge | db.r5.4xlarge | Upgrade | Vertical scale |
| replica | 1 replica | 2 replicas | +1 | Read scaling |
### Storage Gaps
| Storage | Current | Needed (+6mo) | Gap |
|---------|---------|---------------|-----|
| logs | 2TB | 3.6TB | +1.6TB |
| backups | 1TB | 1.5TB | +0.5TB |
Gap Severity Matrix
| Severity |
Criteria |
Action Timeline |
| Critical |
<2 weeks to capacity |
Immediate |
| High |
2-4 weeks to capacity |
This sprint |
| Medium |
1-3 months to capacity |
This quarter |
| Low |
>3 months to capacity |
Next quarter |
Phase 5: Recommendations
Scaling Strategy Options
| Strategy |
Best For |
Lead Time |
Cost Impact |
| Vertical |
DB, stateful |
Hours-days |
Immediate increase |
| Horizontal |
Stateless compute |
Minutes |
Linear increase |
| Reserved |
Predictable load |
Immediate |
30-70% savings |
| Spot |
Batch workloads |
Variable |
60-90% savings |
| Auto-scaling |
Variable load |
Real-time |
Pay for use |
Recommendation Template
## Capacity Recommendations
### Immediate Actions (This Sprint)
| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| api ASG | Increase max from 10 to 15 | Low | +$600/mo max |
| worker ASG | Add 3 instances | Low | +$1,080/mo |
### Short-term Actions (This Quarter)
| Resource | Action | Effort | Cost Impact |
|----------|--------|--------|-------------|
| primary DB | Upgrade to r5.4xlarge | Medium | +$900/mo |
| Add read replica | Provision in us-east-1b | Medium | +$900/mo |
### Long-term Considerations (Next Quarter)
| Consideration | Rationale | Next Step |
|---------------|-----------|-----------|
| Sharding strategy | Single DB approaching limits | Architecture review |
| Multi-region | DR + latency benefits | Infrastructure-architect review |
### Cost Summary
| Timeframe | Current | Recommended | Delta |
|-----------|---------|-------------|-------|
| Monthly | $8,000 | $10,980 | +$2,980 |
| Annual | $96,000 | $131,760 | +$35,760 |
Phase 6: Implementation
Implementation Checklist
Post-Implementation Verification
Anti-Rationalization Table
| Rationalization |
Why It's WRONG |
Required Action |
| "We'll scale when we need to" |
Reactive scaling causes outages |
Proactive capacity planning |
| "Auto-scaling handles everything" |
Auto-scaling has limits and lag |
Set appropriate limits |
| "Current capacity is fine" |
Fine today ≠ fine tomorrow |
Forecast growth |
| "Too expensive to over-provision" |
Outage cost > over-provisioning cost |
Maintain safety margin |
Dispatch Specialists
For capacity planning tasks, dispatch:
Task tool:
subagent_type: "infrastructure-architect"
model: "opus"
prompt: |
CAPACITY PLANNING: [scope]
CURRENT STATE: [baseline]
GROWTH FORECAST: [projection]
REQUEST: [specific analysis needed]
For cost analysis of capacity options:
Task tool:
subagent_type: "cloud-cost-optimizer"
model: "opus"
prompt: |
CAPACITY OPTIONS: [options to evaluate]
CONSTRAINTS: [budget, performance requirements]
REQUEST: Cost-benefit analysis