| name | observability-and-monitoring |
| description | Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns |
Observability and Monitoring
Overview
Core principle: Measure what users care about, alert on symptoms not causes, make alerts actionable.
Rule: Observability without actionability is just expensive logging.
Already have observability tools (CloudWatch, Datadog, etc.)? Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.
Getting Started Decision Tree
| Team Size | Scale | Starting Point | Tools |
|---|---|---|---|
| 1-5 engineers | <10 services | Metrics + logs | Prometheus + Grafana + Loki |
| 5-20 engineers | 10-50 services | Metrics + logs + basic traces | Add Jaeger, OpenTelemetry |
| 20+ engineers | 50+ services | Full observability + SLOs | Managed platform (Datadog, Grafana Cloud) |
First step: Implement metrics with OpenTelemetry + Prometheus
Why this order: Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)
Three Pillars Quick Reference
Metrics (Quantitative, aggregated)
When to use: Alerting, dashboards, trend analysis
What to collect:
- RED method (services): Rate, Errors, Duration
- USE method (resources): Utilization, Saturation, Errors
- Four Golden Signals: Latency, traffic, errors, saturation
Implementation:
# OpenTelemetry metrics
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
"http_requests_total",
description="Total HTTP requests"
)
request_duration = meter.create_histogram(
"http_request_duration_seconds",
description="HTTP request duration"
)
# Instrument request
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
Logs (Discrete events)
When to use: Debugging, audit trails, error investigation
Best practices:
- Structured logging (JSON)
- Include correlation IDs
- Don't log sensitive data (PII, secrets)
Implementation:
import structlog
log = structlog.get_logger()
log.info(
"user_login",
user_id=user_id,
correlation_id=correlation_id,
ip_address=ip,
duration_ms=duration
)
Traces (Request flows)
When to use: Debugging distributed systems, latency investigation
Implementation:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
# Process order logic
Anti-Patterns Catalog
❌ Vanity Metrics
Symptom: Tracking metrics that look impressive but don't inform decisions
Why bad: Wastes resources, distracts from actionable metrics
Fix: Only collect metrics that answer "should I page someone?" or inform business decisions
# ❌ Bad - vanity metric
total_requests_all_time_counter.inc()
# ✅ Good - actionable metric
request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)
❌ Alert on Everything
Symptom: Hundreds of alerts per day, team ignores most of them
Why bad: Alert fatigue, real issues get missed, on-call burnout
Fix: Alert only on user-impacting symptoms that require immediate action
Test: "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.
❌ No Runbooks
Symptom: Alerts fire with no guidance on how to respond
Why bad: Increased MTTR, inconsistent responses, on-call stress
Fix: Every alert must link to a runbook with investigation steps
# ✅ Good alert with runbook
alert: HighErrorRate
annotations:
summary: "Error rate >5% on {{$labels.service}}"
description: "Current: {{$value}}%"
runbook: "https://wiki.company.com/runbooks/high-error-rate"
❌ Cardinality Explosion
Symptom: Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues
Why bad: Expensive storage, slow queries, potential system failure
Fix: Use fixed cardinality labels, aggregate high-cardinality dimensions
# ❌ Bad - unbounded cardinality
request_counter.labels(user_id=user_id).inc() # Millions of unique series
# ✅ Good - bounded cardinality
request_counter.labels(user_type="premium", region="us-east").inc()
❌ Missing Correlation IDs
Symptom: Can't trace requests across services, debugging takes hours
Why bad: High MTTR, frustrated engineers, customer impact
Fix: Generate correlation ID at entry point, propagate through all services
# ✅ Good - correlation ID propagation
import uuid
from contextvars import ContextVar
correlation_id_var = ContextVar("correlation_id", default=None)
def handle_request():
correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
correlation_id_var.set(correlation_id)
# All logs and traces include it automatically
log.info("processing_request", extra={"correlation_id": correlation_id})
SLI Selection Framework
Principle: Measure user experience, not system internals
Four Golden Signals
| Signal | Definition | Example SLI |
|---|---|---|
| Latency | Request response time | p99 latency < 200ms |
| Traffic | Demand on system | Requests per second |
| Errors | Failed requests | Error rate < 0.1% |
| Saturation | Resource fullness | CPU < 80%, queue depth < 100 |
RED Method (for services)
- Rate: Requests per second
- Errors: Error rate (%)
- Duration: Response time (p50, p95, p99)
USE Method (for resources)
- Utilization: % time resource busy (CPU %, disk I/O %)
- Saturation: Queue depth, wait time
- Errors: Error count
Decision framework:
| Service Type | Recommended SLIs |
|---|---|
| User-facing API | Availability (%), p95 latency, error rate |
| Background jobs | Freshness (time since last run), success rate, processing time |
| Data pipeline | Data freshness, completeness (%), processing latency |
| Storage | Availability, durability, latency percentiles |
SLO Definition Guide
SLO = Target value for SLI
Formula: SLO = (good events / total events) >= target
Example:
SLI: Request success rate
SLO: 99.9% of requests succeed (measured over 30 days)
Error budget: 0.1% = ~43 minutes downtime/month
Error Budget
Definition: Amount of unreliability you can tolerate
Calculation:
Error budget = 1 - SLO target
If SLO = 99.9%, error budget = 0.1%
For 1M requests/month: 1,000 requests can fail
Usage: Balance reliability vs feature velocity
Multi-Window Multi-Burn-Rate Alerting
Problem: Simple threshold alerts are either too noisy or too slow
Solution: Alert based on how fast you're burning error budget
# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
alert: ErrorBudgetBurnRateHigh
expr: |
(
rate(errors[1h]) / rate(requests[1h])
) > (14.4 * (1 - 0.999))
annotations:
summary: "Burning error budget at 14.4x rate"
runbook: "https://wiki/runbooks/error-budget-burn"
Alert Design Patterns
Principle: Alert on symptoms (user impact) not causes (CPU high)
Symptom-Based Alerting
# ❌ Bad - alert on cause
alert: HighCPU
expr: cpu_usage > 80%
# ✅ Good - alert on symptom
alert: HighLatency
expr: http_request_duration_p99 > 1.0
Alert Severity Levels
| Level | When | Response Time | Example |
|---|---|---|---|
| Critical | User-impacting | Immediate (page) | Error rate >5%, service down |
| Warning | Will become critical | Next business day | Error rate >1%, disk 85% full |
| Info | Informational | No action needed | Deploy completed, scaling event |
Rule: Only page for critical. Everything else goes to dashboard/Slack.
Cost Optimization Quick Reference
Observability can cost 5-15% of infrastructure spend. Optimize:
Sampling Strategies
# Trace sampling - collect 10% of traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
sampler = TraceIdRatioBased(0.1) # 10% sampling
When to sample:
- Traces: 1-10% for high-traffic services
- Logs: Sample debug/info logs, keep all errors
- Metrics: Don't sample (they're already aggregated)
Retention Policies
| Data Type | Recommended Retention | Rationale |
|---|---|---|
| Metrics | 15 days (raw), 13 months (aggregated) | Trend analysis |
| Logs | 7-30 days | Debugging, compliance |
| Traces | 7 days | Debugging recent issues |
Cardinality Control
# ❌ Bad - high cardinality
http_requests.labels(
method=method,
url=full_url, # Unbounded!
user_id=user_id # Unbounded!
)
# ✅ Good - controlled cardinality
http_requests.labels(
method=method,
endpoint=route_pattern, # /users/:id not /users/12345
status_code=status
)
Tool Ecosystem Quick Reference
| Category | Open Source | Managed/Commercial |
|---|---|---|
| Metrics | Prometheus, VictoriaMetrics | Datadog, New Relic, Grafana Cloud |
| Logs | Loki, ELK Stack | Datadog, Splunk, Sumo Logic |
| Traces | Jaeger, Zipkin | Datadog, Honeycomb, Lightstep |
| All-in-One | Grafana + Loki + Tempo + Mimir | Datadog, New Relic, Dynatrace |
| Instrumentation | OpenTelemetry | (vendor SDKs) |
Recommendation:
- Starting out: Prometheus + Grafana + OpenTelemetry
- Growing (10-50 services): Add Loki (logs) + Jaeger (traces)
- Scale (50+ services): Consider managed (Datadog, Grafana Cloud)
Why OpenTelemetry: Vendor-neutral, future-proof, single instrumentation for all signals
Your First Observability Setup
Goal: Metrics + alerting in one week
Day 1-2: Instrument application
# Add OpenTelemetry
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
# Initialize
meter_provider = MeterProvider(
metric_readers=[PrometheusMetricReader()]
)
metrics.set_meter_provider(meter_provider)
# Instrument HTTP framework (auto-instrumentation)
from opentelemetry.instrumentation.flask import FlaskInstrumentor
FlaskInstrumentor().instrument_app(app)
Day 3-4: Deploy Prometheus + Grafana
# docker-compose.yml
version: '3'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
Day 5: Define SLIs and SLOs
SLI: HTTP request success rate
SLO: 99.9% of requests succeed (30-day window)
Error budget: 0.1% = 43 minutes downtime/month
Day 6: Create alerts
# prometheus-alerts.yml
groups:
- name: slo_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate >5% on {{$labels.service}}"
runbook: "https://wiki/runbooks/high-error-rate"
Day 7: Build dashboard
Panels to include:
- Error rate (%)
- Request rate (req/s)
- p50/p95/p99 latency
- CPU/memory utilization
Common Mistakes
❌ Logging in Production == Debugging in Production
Fix: Use structured logging with correlation IDs, not print statements
❌ Alerting on Predictions, Not Reality
Fix: Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)
❌ Dashboard Sprawl
Fix: One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.
❌ Ignoring Alert Feedback Loop
Fix: Track alert precision (% that led to action). Delete alerts with <50% precision.
Quick Reference
Getting Started:
- Start with metrics (Prometheus + OpenTelemetry)
- Add logs when debugging is hard (Loki)
- Add traces when issues span services (Jaeger)
SLI Selection:
- User-facing: Availability, latency, error rate
- Background: Freshness, success rate, processing time
SLO Targets:
- Start with 99% (achievable)
- Increase to 99.9% only if business requires it
- 99.99% is very expensive (4 nines = 52 min/year downtime)
Alerting:
- Critical only = page
- Warning = next business day
- Info = dashboard only
Cost Control:
- Sample traces (1-10%)
- Control metric cardinality (no unbounded labels)
- Set retention policies (7-30 days logs, 15 days metrics)
Tools:
- Small: Prometheus + Grafana + Loki
- Medium: Add Jaeger
- Large: Consider Datadog, Grafana Cloud
Bottom Line
Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.
Measure what users care about, not what's easy to measure.