name	observability-and-monitoring
description	Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns

Observability and Monitoring

Overview

Core principle: Measure what users care about, alert on symptoms not causes, make alerts actionable.

Rule: Observability without actionability is just expensive logging.

Already have observability tools (CloudWatch, Datadog, etc.)? Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.

Getting Started Decision Tree

Team Size	Scale	Starting Point	Tools
1-5 engineers	<10 services	Metrics + logs	Prometheus + Grafana + Loki
5-20 engineers	10-50 services	Metrics + logs + basic traces	Add Jaeger, OpenTelemetry
20+ engineers	50+ services	Full observability + SLOs	Managed platform (Datadog, Grafana Cloud)

First step: Implement metrics with OpenTelemetry + Prometheus

Why this order: Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)

Three Pillars Quick Reference

Metrics (Quantitative, aggregated)

When to use: Alerting, dashboards, trend analysis

What to collect:

RED method (services): Rate, Errors, Duration
USE method (resources): Utilization, Saturation, Errors
Four Golden Signals: Latency, traffic, errors, saturation

Implementation:

# OpenTelemetry metrics
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)
request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

# Instrument request
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})

Logs (Discrete events)

When to use: Debugging, audit trails, error investigation

Best practices:

Structured logging (JSON)
Include correlation IDs
Don't log sensitive data (PII, secrets)

Implementation:

import structlog

log = structlog.get_logger()
log.info(
    "user_login",
    user_id=user_id,
    correlation_id=correlation_id,
    ip_address=ip,
    duration_ms=duration
)

Traces (Request flows)

When to use: Debugging distributed systems, latency investigation

Implementation:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.id", user_id)
    # Process order logic

Anti-Patterns Catalog

❌ Vanity Metrics

Symptom: Tracking metrics that look impressive but don't inform decisions

Why bad: Wastes resources, distracts from actionable metrics

Fix: Only collect metrics that answer "should I page someone?" or inform business decisions

# ❌ Bad - vanity metric
total_requests_all_time_counter.inc()

# ✅ Good - actionable metric
request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)

❌ Alert on Everything

Symptom: Hundreds of alerts per day, team ignores most of them

Why bad: Alert fatigue, real issues get missed, on-call burnout

Fix: Alert only on user-impacting symptoms that require immediate action

Test: "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.

❌ No Runbooks

Symptom: Alerts fire with no guidance on how to respond

Why bad: Increased MTTR, inconsistent responses, on-call stress

Fix: Every alert must link to a runbook with investigation steps

# ✅ Good alert with runbook
alert: HighErrorRate
annotations:
  summary: "Error rate >5% on {{$labels.service}}"
  description: "Current: {{$value}}%"
  runbook: "https://wiki.company.com/runbooks/high-error-rate"

❌ Cardinality Explosion

Symptom: Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues

Why bad: Expensive storage, slow queries, potential system failure

Fix: Use fixed cardinality labels, aggregate high-cardinality dimensions

# ❌ Bad - unbounded cardinality
request_counter.labels(user_id=user_id).inc()  # Millions of unique series

# ✅ Good - bounded cardinality
request_counter.labels(user_type="premium", region="us-east").inc()

❌ Missing Correlation IDs

Symptom: Can't trace requests across services, debugging takes hours

Why bad: High MTTR, frustrated engineers, customer impact

Fix: Generate correlation ID at entry point, propagate through all services

# ✅ Good - correlation ID propagation
import uuid
from contextvars import ContextVar

correlation_id_var = ContextVar("correlation_id", default=None)

def handle_request():
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
    correlation_id_var.set(correlation_id)

    # All logs and traces include it automatically
    log.info("processing_request", extra={"correlation_id": correlation_id})

SLI Selection Framework

Principle: Measure user experience, not system internals

Four Golden Signals

Signal	Definition	Example SLI
Latency	Request response time	p99 latency < 200ms
Traffic	Demand on system	Requests per second
Errors	Failed requests	Error rate < 0.1%
Saturation	Resource fullness	CPU < 80%, queue depth < 100

RED Method (for services)

Rate: Requests per second
Errors: Error rate (%)
Duration: Response time (p50, p95, p99)

USE Method (for resources)

Utilization: % time resource busy (CPU %, disk I/O %)
Saturation: Queue depth, wait time
Errors: Error count

Decision framework:

Service Type	Recommended SLIs
User-facing API	Availability (%), p95 latency, error rate
Background jobs	Freshness (time since last run), success rate, processing time
Data pipeline	Data freshness, completeness (%), processing latency
Storage	Availability, durability, latency percentiles

SLO Definition Guide

SLO = Target value for SLI

Formula: SLO = (good events / total events) >= target

Example:

SLI: Request success rate
SLO: 99.9% of requests succeed (measured over 30 days)
Error budget: 0.1% = ~43 minutes downtime/month

Error Budget

Definition: Amount of unreliability you can tolerate

Calculation:

Error budget = 1 - SLO target
If SLO = 99.9%, error budget = 0.1%
For 1M requests/month: 1,000 requests can fail

Usage: Balance reliability vs feature velocity

Multi-Window Multi-Burn-Rate Alerting

Problem: Simple threshold alerts are either too noisy or too slow

Solution: Alert based on how fast you're burning error budget

# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
alert: ErrorBudgetBurnRateHigh
expr: |
  (
    rate(errors[1h]) / rate(requests[1h])
  ) > (14.4 * (1 - 0.999))
annotations:
  summary: "Burning error budget at 14.4x rate"
  runbook: "https://wiki/runbooks/error-budget-burn"

Alert Design Patterns

Principle: Alert on symptoms (user impact) not causes (CPU high)

Symptom-Based Alerting

# ❌ Bad - alert on cause
alert: HighCPU
expr: cpu_usage > 80%

# ✅ Good - alert on symptom
alert: HighLatency
expr: http_request_duration_p99 > 1.0

Alert Severity Levels

Level	When	Response Time	Example
Critical	User-impacting	Immediate (page)	Error rate >5%, service down
Warning	Will become critical	Next business day	Error rate >1%, disk 85% full
Info	Informational	No action needed	Deploy completed, scaling event

Rule: Only page for critical. Everything else goes to dashboard/Slack.

Cost Optimization Quick Reference

Observability can cost 5-15% of infrastructure spend. Optimize:

Sampling Strategies

# Trace sampling - collect 10% of traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.1)  # 10% sampling

When to sample:

Traces: 1-10% for high-traffic services
Logs: Sample debug/info logs, keep all errors
Metrics: Don't sample (they're already aggregated)

Retention Policies

Data Type	Recommended Retention	Rationale
Metrics	15 days (raw), 13 months (aggregated)	Trend analysis
Logs	7-30 days	Debugging, compliance
Traces	7 days	Debugging recent issues

Cardinality Control

# ❌ Bad - high cardinality
http_requests.labels(
    method=method,
    url=full_url,  # Unbounded!
    user_id=user_id  # Unbounded!
)

# ✅ Good - controlled cardinality
http_requests.labels(
    method=method,
    endpoint=route_pattern,  # /users/:id not /users/12345
    status_code=status
)

Tool Ecosystem Quick Reference

Category	Open Source	Managed/Commercial
Metrics	Prometheus, VictoriaMetrics	Datadog, New Relic, Grafana Cloud
Logs	Loki, ELK Stack	Datadog, Splunk, Sumo Logic
Traces	Jaeger, Zipkin	Datadog, Honeycomb, Lightstep
All-in-One	Grafana + Loki + Tempo + Mimir	Datadog, New Relic, Dynatrace
Instrumentation	OpenTelemetry	(vendor SDKs)

Recommendation:

Starting out: Prometheus + Grafana + OpenTelemetry
Growing (10-50 services): Add Loki (logs) + Jaeger (traces)
Scale (50+ services): Consider managed (Datadog, Grafana Cloud)

Why OpenTelemetry: Vendor-neutral, future-proof, single instrumentation for all signals

Your First Observability Setup

Goal: Metrics + alerting in one week

Day 1-2: Instrument application

# Add OpenTelemetry
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Initialize
meter_provider = MeterProvider(
    metric_readers=[PrometheusMetricReader()]
)
metrics.set_meter_provider(meter_provider)

# Instrument HTTP framework (auto-instrumentation)
from opentelemetry.instrumentation.flask import FlaskInstrumentor
FlaskInstrumentor().instrument_app(app)

Day 3-4: Deploy Prometheus + Grafana

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

Day 5: Define SLIs and SLOs

SLI: HTTP request success rate
SLO: 99.9% of requests succeed (30-day window)
Error budget: 0.1% = 43 minutes downtime/month

Day 6: Create alerts

# prometheus-alerts.yml
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Error rate >5% on {{$labels.service}}"
          runbook: "https://wiki/runbooks/high-error-rate"

Day 7: Build dashboard

Panels to include:

Error rate (%)
Request rate (req/s)
p50/p95/p99 latency
CPU/memory utilization

Common Mistakes

❌ Logging in Production == Debugging in Production

Fix: Use structured logging with correlation IDs, not print statements

❌ Alerting on Predictions, Not Reality

Fix: Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)

❌ Dashboard Sprawl

Fix: One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.

❌ Ignoring Alert Feedback Loop

Fix: Track alert precision (% that led to action). Delete alerts with <50% precision.

Quick Reference

Getting Started:

Start with metrics (Prometheus + OpenTelemetry)
Add logs when debugging is hard (Loki)
Add traces when issues span services (Jaeger)

SLI Selection:

User-facing: Availability, latency, error rate
Background: Freshness, success rate, processing time

SLO Targets:

Start with 99% (achievable)
Increase to 99.9% only if business requires it
99.99% is very expensive (4 nines = 52 min/year downtime)

Alerting:

Critical only = page
Warning = next business day
Info = dashboard only

Cost Control:

Sample traces (1-10%)
Control metric cardinality (no unbounded labels)
Set retention policies (7-30 days logs, 15 days metrics)

Tools:

Small: Prometheus + Grafana + Loki
Medium: Add Jaeger
Large: Consider Datadog, Grafana Cloud

Bottom Line

Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.

Measure what users care about, not what's easy to measure.

observability-and-monitoring

Install Skill

SKILL.md

Observability and Monitoring

Overview

Getting Started Decision Tree

Three Pillars Quick Reference

Metrics (Quantitative, aggregated)

Logs (Discrete events)

Traces (Request flows)

Anti-Patterns Catalog

❌ Vanity Metrics

❌ Alert on Everything

❌ No Runbooks

❌ Cardinality Explosion

❌ Missing Correlation IDs

SLI Selection Framework

Four Golden Signals

RED Method (for services)

USE Method (for resources)

SLO Definition Guide

Error Budget

Multi-Window Multi-Burn-Rate Alerting

Alert Design Patterns

Symptom-Based Alerting

Alert Severity Levels

Cost Optimization Quick Reference

Sampling Strategies

Retention Policies

Cardinality Control

Tool Ecosystem Quick Reference

Your First Observability Setup

Common Mistakes

❌ Logging in Production == Debugging in Production

❌ Alerting on Predictions, Not Reality

❌ Dashboard Sprawl

❌ Ignoring Alert Feedback Loop

Quick Reference

Bottom Line