| name | monitoring-observability |
| description | Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing. |
Monitoring & Observability
Overview
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
- Setting up monitoring for new services
- Designing alerts and dashboards
- Troubleshooting performance issues
- Implementing SLO tracking and error budgets
- Choosing between monitoring tools
- Integrating OpenTelemetry instrumentation
- Analyzing metrics, logs, and traces
- Optimizing Datadog costs and finding waste
- Migrating from Datadog to open-source stack
Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
1. Design Metrics Strategy
Start with The Four Golden Signals
Every service should monitor:
- Latency: Response time (p50, p95, p99)
- Traffic: Requests per second
- Errors: Failure rate
- Saturation: Resource utilization
For request-driven services, use the RED Method:
- Rate: Requests/sec
- Errors: Error rate
- Duration: Response time
For infrastructure resources, use the USE Method:
- Utilization: % time busy
- Saturation**: Queue depth
- Errors**: Error count
Quick Start - Web Application Example:
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Deep Dive: Metric Design
For comprehensive metric design guidance including:
- Metric types (counter, gauge, histogram, summary)
- Cardinality best practices
- Naming conventions
- Dashboard design principles
→ Read: references/metrics_design.md
Automated Metric Analysis
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48
→ Script: scripts/analyze_metrics.py
2. Log Aggregation & Analysis
Structured Logging Checklist
Every log entry should include:
- ✅ Timestamp (ISO 8601 format)
- ✅ Log level (DEBUG, INFO, WARN, ERROR, FATAL)
- ✅ Message (human-readable)
- ✅ Service name
- ✅ Request ID (for tracing)
Example structured log (JSON):
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}
Log Aggregation Patterns
ELK Stack (Elasticsearch, Logstash, Kibana):
- Best for: Deep log analysis, complex queries
- Cost: High (infrastructure + operations)
- Complexity: High
Grafana Loki:
- Best for: Cost-effective logging, Kubernetes
- Cost: Low
- Complexity: Medium
CloudWatch Logs:
- Best for: AWS-centric applications
- Cost: Medium
- Complexity: Low
Log Analysis
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces
→ Script: scripts/log_analyzer.py
Deep Dive: Logging
For comprehensive logging guidance including:
- Structured logging implementation examples (Python, Node.js, Go, Java)
- Log aggregation patterns (ELK, Loki, CloudWatch, Fluentd)
- Query patterns and best practices
- PII redaction and security
- Sampling and rate limiting
→ Read: references/logging_guide.md
3. Alert Design
Alert Design Principles
- Every alert must be actionable - If you can't do something, don't alert
- Alert on symptoms, not causes - Alert on user experience, not components
- Tie alerts to SLOs - Connect to business impact
- Reduce noise - Only page for critical issues
Alert Severity Levels
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Multi-Window Burn Rate Alerting
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warning
Alert Quality Checker
Audit your alert rules against best practices:
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/
Checks for:
- Alert naming conventions
- Required labels (severity, team)
- Required annotations (summary, description, runbook_url)
- PromQL expression quality
- 'for' clause to prevent flapping
→ Script: scripts/alert_quality_checker.py
Alert Templates
Production-ready alert rule templates:
→ Templates:
- assets/templates/prometheus-alerts/webapp-alerts.yml - Web application alerts
- assets/templates/prometheus-alerts/kubernetes-alerts.yml - Kubernetes alerts
Deep Dive: Alerting
For comprehensive alerting guidance including:
- Alert design patterns (multi-window, rate of change, threshold with hysteresis)
- Alert annotation best practices
- Alert routing (severity-based, team-based, time-based)
- Inhibition rules
- Runbook structure
- On-call best practices
→ Read: references/alerting_best_practices.md
Runbook Template
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
4. Dashboard & Visualization
Dashboard Design Principles
- Top-down layout: Most important metrics first
- Color coding: Red (critical), yellow (warning), green (healthy)
- Consistent time windows: All panels use same time range
- Limit panels: 8-12 panels per dashboard maximum
- Include context: Show related metrics together
Recommended Dashboard Structure
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘
Generate Grafana Dashboards
Automatically generate dashboards from templates:
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.json
Supports:
- Web applications (requests, errors, latency, resources)
- Kubernetes (pods, nodes, resources, network)
- Databases (PostgreSQL, MySQL)
→ Script: scripts/dashboard_generator.py
5. SLO & Error Budgets
SLO Fundamentals
SLI (Service Level Indicator): Measurement of service quality
- Example: Request latency, error rate, availability
SLO (Service Level Objective): Target value for an SLI
- Example: "99.9% of requests return in < 500ms"
Error Budget: Allowed failure amount = (100% - SLO)
- Example: 99.9% SLO = 0.1% error budget = 43.2 minutes/month
Common SLO Targets
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
SLO Calculator
Calculate compliance, error budgets, and burn rates:
# Show SLO reference table
python3 scripts/slo_calculator.py --table
# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
--slo 99.9 \
--total-requests 1000000 \
--failed-requests 1500 \
--period-days 30
# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
--slo 99.9 \
--errors 50 \
--requests 10000 \
--window-hours 1
→ Script: scripts/slo_calculator.py
Deep Dive: SLO/SLA
For comprehensive SLO/SLA guidance including:
- Choosing appropriate SLIs
- Setting realistic SLO targets
- Error budget policies
- Burn rate alerting
- SLA structure and contracts
- Monthly reporting templates
→ Read: references/slo_sla_guide.md
6. Distributed Tracing
When to Use Tracing
Use distributed tracing when you need to:
- Debug performance issues across services
- Understand request flow through microservices
- Identify bottlenecks in distributed systems
- Find N+1 query problems
OpenTelemetry Implementation
Python example:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raise
Sampling Strategies
- Development: 100% (ALWAYS_ON)
- Staging: 50-100%
- Production: 1-10% (or error-based sampling)
Error-based sampling (always sample errors, 1% of successes):
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROP
OTel Collector Configuration
Production-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
- Receives OTLP, Prometheus, and host metrics
- Batching and memory limiting
- Tail sampling (error-based, latency-based, probabilistic)
- Multiple exporters (Tempo, Jaeger, Loki, Prometheus, CloudWatch, Datadog)
Deep Dive: Tracing
For comprehensive tracing guidance including:
- OpenTelemetry instrumentation (Python, Node.js, Go, Java)
- Span attributes and semantic conventions
- Context propagation (W3C Trace Context)
- Backend comparison (Jaeger, Tempo, X-Ray, Datadog APM)
- Analysis patterns (finding slow traces, N+1 queries)
- Integration with logs
→ Read: references/tracing_guide.md
7. Datadog Cost Optimization & Migration
Scenario 1: I'm Using Datadog and Costs Are Too High
If your Datadog bill is growing out of control, start by identifying waste:
Cost Analysis Script
Automatically analyze your Datadog usage and find cost optimization opportunities:
# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY
# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY \
--show-details
What it checks:
- Infrastructure host count and cost
- Custom metrics usage and high-cardinality metrics
- Log ingestion volume and trends
- APM host usage
- Unused or noisy monitors
- Container vs VM optimization opportunities
→ Script: scripts/datadog_cost_analyzer.py
Common Cost Optimization Strategies
1. Custom Metrics Optimization (typical savings: 20-40%):
- Remove high-cardinality tags (user IDs, request IDs)
- Delete unused custom metrics
- Aggregate metrics before sending
- Use metric prefixes to identify teams/services
2. Log Management (typical savings: 30-50%):
- Implement log sampling for high-volume services
- Use exclusion filters for debug/trace logs in production
- Archive cold logs to S3/GCS after 7 days
- Set log retention policies (15 days instead of 30)
3. APM Optimization (typical savings: 15-25%):
- Reduce trace sampling rates (10% → 5% in prod)
- Use head-based sampling instead of complete sampling
- Remove APM from non-critical services
- Use trace search with lower retention
4. Infrastructure Monitoring (typical savings: 10-20%):
- Switch from VM-based to container-based pricing where possible
- Remove agents from ephemeral instances
- Use Datadog's host reduction strategies
- Consolidate staging environments
Scenario 2: Migrating Away from Datadog
If you're considering migrating to a more cost-effective open-source stack:
Migration Overview
From Datadog → To Open Source Stack:
- Metrics: Datadog → Prometheus + Grafana
- Logs: Datadog Logs → Grafana Loki
- Traces: Datadog APM → Tempo or Jaeger
- Dashboards: Datadog → Grafana
- Alerts: Datadog Monitors → Prometheus Alertmanager
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
Migration Strategy
Phase 1: Run Parallel (Month 1-2):
- Deploy open-source stack alongside Datadog
- Migrate metrics first (lowest risk)
- Validate data accuracy
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
- Convert Datadog dashboards to Grafana
- Translate alert rules (use DQL → PromQL guide below)
- Train team on new tools
Phase 3: Migrate Logs & Traces (Month 3-4):
- Set up Loki for log aggregation
- Deploy Tempo/Jaeger for tracing
- Update application instrumentation
Phase 4: Decommission Datadog (Month 4-5):
- Confirm all functionality migrated
- Cancel Datadog subscription
Query Translation: DQL → PromQL
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
→ Full Translation Guide: references/dql_promql_translation.md
Cost Comparison
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
Deep Dive: Datadog Migration
For comprehensive migration guidance including:
- Detailed cost comparison and ROI calculations
- Step-by-step migration instructions
- Infrastructure sizing recommendations (CPU, RAM, storage)
- Dashboard conversion tools and examples
- Alert rule translation patterns
- Application instrumentation changes (DogStatsD → Prometheus client)
- Python scripts for exporting Datadog dashboards and monitors
- Common challenges and solutions
→ Read: references/datadog_migration.md
8. Tool Selection & Comparison
Decision Matrix
Choose Prometheus + Grafana if:
- ✅ Using Kubernetes
- ✅ Want control and customization
- ✅ Have ops capacity
- ✅ Budget-conscious
Choose Datadog if:
- ✅ Want ease of use
- ✅ Need full observability now
- ✅ Budget allows ($8k+/month for 100 hosts)
Choose Grafana Stack (LGTM) if:
- ✅ Want open source full stack
- ✅ Cost-effective solution
- ✅ Cloud-native architecture
Choose ELK Stack if:
- ✅ Heavy log analysis needs
- ✅ Need powerful search
- ✅ Have dedicated ops team
Choose Cloud Native (CloudWatch/etc) if:
- ✅ Single cloud provider
- ✅ Simple needs
- ✅ Want minimal setup
Cost Comparison (100 hosts, 1TB logs/month)
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
Deep Dive: Tool Comparison
For comprehensive tool comparison including:
- Metrics platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana Cloud)
- Logging platforms (ELK, Loki, Splunk, CloudWatch Logs, Sumo Logic)
- Tracing platforms (Jaeger, Tempo, Datadog APM, X-Ray)
- Full-stack observability comparison
- Recommendations by company size
→ Read: references/tool_comparison.md
9. Troubleshooting & Analysis
Health Check Validation
Validate health check endpoints against best practices:
# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 scripts/health_check_validator.py \
https://api.example.com/health \
https://api.example.com/readiness \
--verbose
Checks for:
- ✓ Returns 200 status code
- ✓ Response time < 1 second
- ✓ Returns JSON format
- ✓ Contains 'status' field
- ✓ Includes version/build info
- ✓ Checks dependencies
- ✓ Disables caching
→ Script: scripts/health_check_validator.py
Common Troubleshooting Workflows
High Latency Investigation:
- Check dashboards for latency spike
- Query traces for slow operations
- Check database slow query log
- Check external API response times
- Review recent deployments
- Check resource utilization (CPU, memory)
High Error Rate Investigation:
- Check error logs for patterns
- Identify affected endpoints
- Check dependency health
- Review recent deployments
- Check resource limits
- Verify configuration
Service Down Investigation:
- Check if pods/instances are running
- Check health check endpoint
- Review recent deployments
- Check resource availability
- Check network connectivity
- Review logs for startup errors
Quick Reference Commands
Prometheus Queries
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Kubernetes Commands
# Check pod status
kubectl get pods -n <namespace>
# View pod logs
kubectl logs -f <pod-name> -n <namespace>
# Check pod resources
kubectl top pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>
Log Queries
Elasticsearch:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}
Loki (LogQL):
{job="app", level="error"} |= "error" | json
CloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)
Resources Summary
Scripts (automation and analysis)
analyze_metrics.py- Detect anomalies in Prometheus/CloudWatch metricsalert_quality_checker.py- Audit alert rules against best practicesslo_calculator.py- Calculate SLO compliance and error budgetslog_analyzer.py- Parse logs for errors and patternsdashboard_generator.py- Generate Grafana dashboards from templateshealth_check_validator.py- Validate health check endpointsdatadog_cost_analyzer.py- Analyze Datadog usage and find cost waste
References (deep-dive documentation)
metrics_design.md- Four Golden Signals, RED/USE methods, metric typesalerting_best_practices.md- Alert design, runbooks, on-call practiceslogging_guide.md- Structured logging, aggregation patternstracing_guide.md- OpenTelemetry, distributed tracingslo_sla_guide.md- SLI/SLO/SLA definitions, error budgetstool_comparison.md- Comprehensive comparison of monitoring toolsdatadog_migration.md- Complete guide for migrating from Datadog to OSS stackdql_promql_translation.md- Datadog Query Language to PromQL translation reference
Templates (ready-to-use configurations)
prometheus-alerts/webapp-alerts.yml- Production-ready web app alertsprometheus-alerts/kubernetes-alerts.yml- Kubernetes monitoring alertsotel-config/collector-config.yaml- OpenTelemetry Collector configurationrunbooks/incident-runbook-template.md- Incident response template
Best Practices
Metrics
- Start with Four Golden Signals
- Use appropriate metric types (counter, gauge, histogram)
- Keep cardinality low (avoid high-cardinality labels)
- Follow naming conventions
Logging
- Use structured logging (JSON)
- Include request IDs for tracing
- Set appropriate log levels
- Redact PII before logging
Alerting
- Make every alert actionable
- Alert on symptoms, not causes
- Use multi-window burn rate alerts
- Include runbook links
Tracing
- Sample appropriately (1-10% in production)
- Always record errors
- Use semantic conventions
- Propagate context between services
SLOs
- Start with current performance
- Set realistic targets
- Define error budget policies
- Review and adjust quarterly