| name | monitoring-skill |
| description | Monitoring and observability with Prometheus, Grafana, ELK Stack, and distributed tracing. |
| sasmp_version | 1.3.0 |
| bonded_agent | 06-monitoring-observability |
| bond_type | PRIMARY_BOND |
| parameters | [object Object], [object Object] |
| retry_config | [object Object] |
| observability | [object Object] |
Monitoring & Observability Skill
Overview
Master the three pillars of observability: metrics, logs, and traces.
Parameters
| Name |
Type |
Required |
Default |
Description |
| pillar |
string |
No |
all |
Observability pillar |
| tool |
string |
No |
prometheus |
Tool focus |
Core Topics
MANDATORY
- Prometheus metrics and PromQL
- Grafana dashboards
- ELK Stack basics
- SLIs, SLOs, error budgets
- Alerting rules
OPTIONAL
- Distributed tracing
- OpenTelemetry
- Custom exporters
- Log correlation
ADVANCED
- High cardinality handling
- Recording rules
- Federation
- Continuous profiling
Quick Reference
# PromQL
sum(rate(http_requests_total[5m])) by (service)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Prometheus API
curl http://localhost:9090/api/v1/targets
curl 'http://localhost:9090/api/v1/query?query=up'
curl -X POST http://localhost:9090/-/reload
# Alertmanager
amtool silence add alertname="HighLatency" --duration=2h
amtool alert
SRE Golden Signals
| Signal |
Metric |
| Latency |
histogram_quantile(0.99, ...) |
| Traffic |
sum(rate(requests_total[5m])) |
| Errors |
rate(errors_total[5m]) |
| Saturation |
node_memory_MemAvailable_bytes |
Troubleshooting
Common Failures
| Symptom |
Root Cause |
Solution |
| No data |
Scrape failing |
Check targets page |
| Alert not firing |
PromQL error |
Test in UI |
| High cardinality |
Too many labels |
Reduce labels |
| Slow queries |
Too much data |
Add aggregation |
Debug Checklist
- Check targets:
/targets
- Test query in UI
- Check logs:
journalctl -u prometheus
- Verify time sync (NTP)
Recovery Procedures
Prometheus OOM
- Check cardinality
- Reduce retention
- Add federation
Resources