| name | ops-monitor |
| model | claude-haiku-4-5 |
| description | Monitor deployed infrastructure health and performance - check resource status, query CloudWatch metrics (CPU, memory, requests, errors), analyze performance trends, track SLI/SLO metrics, detect anomalies, generate health reports with resource status summaries, identify degraded services, provide performance optimization recommendations. |
| tools | Bash, Read, Write |
Operations Monitoring Skill
EXECUTE STEPS:
Step 1: Load Configuration and Registry
- Read: .fractary/plugins/faber-cloud/devops.json
- Read: .fractary/plugins/faber-cloud/deployments/${environment}/registry.json
- Extract: List of deployed resources to monitor
- Output: "✓ Found ${resource_count} resources to monitor"
Step 2: Determine Operation
- If operation == "health-check":
- Read: workflow/health-check.md
- Check status of all resources
- If operation == "performance-analysis":
- Read: workflow/performance-analysis.md
- Analyze metrics and trends
- If operation == "metrics-query":
- Read: workflow/metrics-query.md
- Query specific metrics
- Output: "✓ Operation determined: ${operation}"
Step 3: Execute Monitoring
- For each resource in scope:
- Query resource status via handler
- Query CloudWatch metrics
- Analyze current state
- Compare against thresholds
- Collect results for all resources
- Output: "✓ Monitoring completed for ${resource_count} resources"
Step 4: Analyze Results
- Read: workflow/analyze-health.md
- Categorize resources: healthy / degraded / unhealthy
- Identify patterns (multiple failures, related issues)
- Detect anomalies (unusual metrics, sudden changes)
- Output: "✓ Analysis complete"
Step 5: Generate Report
- Create monitoring report with:
- Overall health status
- Resource-by-resource status
- Metrics summary
- Issues found
- Recommendations
- Save to: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Output: "✓ Report generated: ${report_path}"
Step 6: Check Thresholds
- Compare metrics against configured thresholds
- Identify threshold violations
- Prioritize by severity
- Output: "✓ Threshold check complete"
OUTPUT COMPLETION MESSAGE:
✅ COMPLETED: Operations Monitoring
Status: ${overall_health}
Resources Checked: ${total_count}
Healthy: ${healthy_count}
Degraded: ${degraded_count}
Unhealthy: ${unhealthy_count}
${issues_summary}
Report: ${report_path}
───────────────────────────────────────
${recommendations_summary}
IF ISSUES FOUND:
⚠️ COMPLETED: Operations Monitoring (Issues Found)
Status: DEGRADED
Resources Checked: ${total_count}
Unhealthy: ${unhealthy_count}
Issues:
${issue_list}
Recommendations:
${recommendations}
───────────────────────────────────────
Next: Investigate issues with ops-investigator
IF FAILURE:
❌ FAILED: Operations Monitoring
Step: ${failed_step}
Error: ${error_message}
───────────────────────────────────────
Resolution: ${resolution_steps}
✅ 1. Resources Identified
- Resource registry loaded
- All resources in scope identified
- Resource types determined
✅ 2. Status Checked
- Resource status queried from AWS
- CloudWatch metrics collected
- Current state determined
✅ 3. Health Analyzed
- Resources categorized by health
- Issues identified and prioritized
- Patterns and anomalies detected
✅ 4. Report Generated
- Monitoring report created
- All findings documented
- Recommendations provided
✅ 5. Thresholds Evaluated
- Metrics compared to thresholds
- Violations identified
- Severity assessed
FAILURE CONDITIONS - Stop and report if: ❌ Cannot access CloudWatch (check AWS permissions) ❌ Resource registry not found (no deployments in environment) ❌ CloudWatch logs/metrics not available (check resource configuration)
PARTIAL COMPLETION - Not acceptable: ⚠️ Some resources not checked → Return to Step 3 ⚠️ Report not generated → Return to Step 5
Monitoring Report
- Location: .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Format: JSON with detailed findings
- Contains: Health status, metrics, issues, recommendations
Health Summary
- Overall status: HEALTHY / DEGRADED / UNHEALTHY
- Resource counts by status
- Critical issues list
- Priority recommendations
Return to agent:
{
"overall_health": "HEALTHY|DEGRADED|UNHEALTHY",
"environment": "${environment}",
"timestamp": "2025-10-28T...",
"resources": {
"total": 10,
"healthy": 8,
"degraded": 1,
"unhealthy": 1
},
"issues": [
{
"severity": "HIGH",
"resource": "api-lambda",
"issue": "Error rate above threshold (5.2% > 1%)",
"metric": "Errors",
"current_value": "5.2%",
"threshold": "1%"
}
],
"metrics_summary": {
"api-lambda": {
"invocations": 1250,
"errors": 65,
"error_rate": "5.2%",
"duration_avg": "245ms",
"throttles": 0
}
},
"recommendations": [
"Investigate api-lambda errors (5.2% error rate)",
"Consider increasing Lambda memory (avg duration 245ms)",
"Review database connection pooling"
],
"report_path": ".fractary/plugins/faber-cloud/monitoring/test/2025-10-28-health-check.json"
}
**USE SKILL: handler-hosting-${hosting_handler}**
Operation: get-resource-status | query-metrics
Arguments: ${resource_id} ${metric_name} ${timeframe}
Reports are stored in:
- .fractary/plugins/faber-cloud/monitoring/${environment}/${timestamp}-${operation}.json
- Historical trends in monitoring-history.json
HEALTHY:
- Resource exists and is running
- All metrics within thresholds
- No errors or minimal error rate (<0.1%)
- Performance acceptable
DEGRADED:
- Resource exists and is running
- Some metrics approaching thresholds (>80%)
- Elevated error rate (0.1% - 1%)
- Performance slightly degraded
UNHEALTHY:
- Resource doesn't exist or is stopped
- Metrics exceed thresholds
- High error rate (>1%)
- Performance severely degraded
- Resource in failed state
UNKNOWN:
- Cannot determine status
- Metrics not available
- CloudWatch access issues
Lambda:
- Invocations (count)
- Errors (count)
- Duration (ms)
- Throttles (count)
- ConcurrentExecutions (count)
- Error rate = Errors / Invocations * 100
S3:
- BucketSizeBytes (bytes)
- NumberOfObjects (count)
- 4xxErrors (count)
- 5xxErrors (count)
RDS:
- CPUUtilization (percent)
- DatabaseConnections (count)
- FreeableMemory (bytes)
- ReadLatency (seconds)
- WriteLatency (seconds)
ECS:
- CPUUtilization (percent)
- MemoryUtilization (percent)
- RunningTaskCount (count)
- DesiredTaskCount (count)
API Gateway:
- Count (requests)
- 4XXError (count)
- 5XXError (count)
- Latency (ms)
- IntegrationLatency (ms)