Claude Code Plugins

Community-maintained marketplace

Feedback

observability-and-monitoring

@tachyon-beep/skillpacks
4
0

Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name observability-and-monitoring
description Use when implementing metrics/logs/traces, defining SLIs/SLOs, designing alerts, choosing observability tools, debugging alert fatigue, or optimizing observability costs - provides SRE frameworks, anti-patterns, and implementation patterns

Observability and Monitoring

Overview

Core principle: Measure what users care about, alert on symptoms not causes, make alerts actionable.

Rule: Observability without actionability is just expensive logging.

Already have observability tools (CloudWatch, Datadog, etc.)? Optimize what you have first. Most observability problems are usage/process issues, not tooling. Implement SLIs/SLOs, clean up alerts, add runbooks with existing tools. Migrate only if you hit concrete tool limitations (cost, features, multi-cloud). Tool migration is expensive - make sure it solves a real problem.

Getting Started Decision Tree

Team Size Scale Starting Point Tools
1-5 engineers <10 services Metrics + logs Prometheus + Grafana + Loki
5-20 engineers 10-50 services Metrics + logs + basic traces Add Jaeger, OpenTelemetry
20+ engineers 50+ services Full observability + SLOs Managed platform (Datadog, Grafana Cloud)

First step: Implement metrics with OpenTelemetry + Prometheus

Why this order: Metrics give you fastest time-to-value (detect issues), logs help debug (understand what happened), traces solve complex distributed problems (debug cross-service issues)

Three Pillars Quick Reference

Metrics (Quantitative, aggregated)

When to use: Alerting, dashboards, trend analysis

What to collect:

  • RED method (services): Rate, Errors, Duration
  • USE method (resources): Utilization, Saturation, Errors
  • Four Golden Signals: Latency, traffic, errors, saturation

Implementation:

# OpenTelemetry metrics
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)
request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

# Instrument request
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})

Logs (Discrete events)

When to use: Debugging, audit trails, error investigation

Best practices:

  • Structured logging (JSON)
  • Include correlation IDs
  • Don't log sensitive data (PII, secrets)

Implementation:

import structlog

log = structlog.get_logger()
log.info(
    "user_login",
    user_id=user_id,
    correlation_id=correlation_id,
    ip_address=ip,
    duration_ms=duration
)

Traces (Request flows)

When to use: Debugging distributed systems, latency investigation

Implementation:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.id", user_id)
    # Process order logic

Anti-Patterns Catalog

❌ Vanity Metrics

Symptom: Tracking metrics that look impressive but don't inform decisions

Why bad: Wastes resources, distracts from actionable metrics

Fix: Only collect metrics that answer "should I page someone?" or inform business decisions

# ❌ Bad - vanity metric
total_requests_all_time_counter.inc()

# ✅ Good - actionable metric
request_error_rate.labels(service="api", endpoint="/users").observe(error_rate)

❌ Alert on Everything

Symptom: Hundreds of alerts per day, team ignores most of them

Why bad: Alert fatigue, real issues get missed, on-call burnout

Fix: Alert only on user-impacting symptoms that require immediate action

Test: "If this alert fires at 2am, should someone wake up to fix it?" If no, it's not an alert.


❌ No Runbooks

Symptom: Alerts fire with no guidance on how to respond

Why bad: Increased MTTR, inconsistent responses, on-call stress

Fix: Every alert must link to a runbook with investigation steps

# ✅ Good alert with runbook
alert: HighErrorRate
annotations:
  summary: "Error rate >5% on {{$labels.service}}"
  description: "Current: {{$value}}%"
  runbook: "https://wiki.company.com/runbooks/high-error-rate"

❌ Cardinality Explosion

Symptom: Metrics with unbounded labels (user IDs, timestamps, UUIDs) cause storage/performance issues

Why bad: Expensive storage, slow queries, potential system failure

Fix: Use fixed cardinality labels, aggregate high-cardinality dimensions

# ❌ Bad - unbounded cardinality
request_counter.labels(user_id=user_id).inc()  # Millions of unique series

# ✅ Good - bounded cardinality
request_counter.labels(user_type="premium", region="us-east").inc()

❌ Missing Correlation IDs

Symptom: Can't trace requests across services, debugging takes hours

Why bad: High MTTR, frustrated engineers, customer impact

Fix: Generate correlation ID at entry point, propagate through all services

# ✅ Good - correlation ID propagation
import uuid
from contextvars import ContextVar

correlation_id_var = ContextVar("correlation_id", default=None)

def handle_request():
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
    correlation_id_var.set(correlation_id)

    # All logs and traces include it automatically
    log.info("processing_request", extra={"correlation_id": correlation_id})

SLI Selection Framework

Principle: Measure user experience, not system internals

Four Golden Signals

Signal Definition Example SLI
Latency Request response time p99 latency < 200ms
Traffic Demand on system Requests per second
Errors Failed requests Error rate < 0.1%
Saturation Resource fullness CPU < 80%, queue depth < 100

RED Method (for services)

  • Rate: Requests per second
  • Errors: Error rate (%)
  • Duration: Response time (p50, p95, p99)

USE Method (for resources)

  • Utilization: % time resource busy (CPU %, disk I/O %)
  • Saturation: Queue depth, wait time
  • Errors: Error count

Decision framework:

Service Type Recommended SLIs
User-facing API Availability (%), p95 latency, error rate
Background jobs Freshness (time since last run), success rate, processing time
Data pipeline Data freshness, completeness (%), processing latency
Storage Availability, durability, latency percentiles

SLO Definition Guide

SLO = Target value for SLI

Formula: SLO = (good events / total events) >= target

Example:

SLI: Request success rate
SLO: 99.9% of requests succeed (measured over 30 days)
Error budget: 0.1% = ~43 minutes downtime/month

Error Budget

Definition: Amount of unreliability you can tolerate

Calculation:

Error budget = 1 - SLO target
If SLO = 99.9%, error budget = 0.1%
For 1M requests/month: 1,000 requests can fail

Usage: Balance reliability vs feature velocity

Multi-Window Multi-Burn-Rate Alerting

Problem: Simple threshold alerts are either too noisy or too slow

Solution: Alert based on how fast you're burning error budget

# Alert if burning budget 14.4x faster than acceptable (5% in 1 hour)
alert: ErrorBudgetBurnRateHigh
expr: |
  (
    rate(errors[1h]) / rate(requests[1h])
  ) > (14.4 * (1 - 0.999))
annotations:
  summary: "Burning error budget at 14.4x rate"
  runbook: "https://wiki/runbooks/error-budget-burn"

Alert Design Patterns

Principle: Alert on symptoms (user impact) not causes (CPU high)

Symptom-Based Alerting

# ❌ Bad - alert on cause
alert: HighCPU
expr: cpu_usage > 80%

# ✅ Good - alert on symptom
alert: HighLatency
expr: http_request_duration_p99 > 1.0

Alert Severity Levels

Level When Response Time Example
Critical User-impacting Immediate (page) Error rate >5%, service down
Warning Will become critical Next business day Error rate >1%, disk 85% full
Info Informational No action needed Deploy completed, scaling event

Rule: Only page for critical. Everything else goes to dashboard/Slack.

Cost Optimization Quick Reference

Observability can cost 5-15% of infrastructure spend. Optimize:

Sampling Strategies

# Trace sampling - collect 10% of traces
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

sampler = TraceIdRatioBased(0.1)  # 10% sampling

When to sample:

  • Traces: 1-10% for high-traffic services
  • Logs: Sample debug/info logs, keep all errors
  • Metrics: Don't sample (they're already aggregated)

Retention Policies

Data Type Recommended Retention Rationale
Metrics 15 days (raw), 13 months (aggregated) Trend analysis
Logs 7-30 days Debugging, compliance
Traces 7 days Debugging recent issues

Cardinality Control

# ❌ Bad - high cardinality
http_requests.labels(
    method=method,
    url=full_url,  # Unbounded!
    user_id=user_id  # Unbounded!
)

# ✅ Good - controlled cardinality
http_requests.labels(
    method=method,
    endpoint=route_pattern,  # /users/:id not /users/12345
    status_code=status
)

Tool Ecosystem Quick Reference

Category Open Source Managed/Commercial
Metrics Prometheus, VictoriaMetrics Datadog, New Relic, Grafana Cloud
Logs Loki, ELK Stack Datadog, Splunk, Sumo Logic
Traces Jaeger, Zipkin Datadog, Honeycomb, Lightstep
All-in-One Grafana + Loki + Tempo + Mimir Datadog, New Relic, Dynatrace
Instrumentation OpenTelemetry (vendor SDKs)

Recommendation:

  • Starting out: Prometheus + Grafana + OpenTelemetry
  • Growing (10-50 services): Add Loki (logs) + Jaeger (traces)
  • Scale (50+ services): Consider managed (Datadog, Grafana Cloud)

Why OpenTelemetry: Vendor-neutral, future-proof, single instrumentation for all signals

Your First Observability Setup

Goal: Metrics + alerting in one week

Day 1-2: Instrument application

# Add OpenTelemetry
from opentelemetry import metrics, trace
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Initialize
meter_provider = MeterProvider(
    metric_readers=[PrometheusMetricReader()]
)
metrics.set_meter_provider(meter_provider)

# Instrument HTTP framework (auto-instrumentation)
from opentelemetry.instrumentation.flask import FlaskInstrumentor
FlaskInstrumentor().instrument_app(app)

Day 3-4: Deploy Prometheus + Grafana

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"

Day 5: Define SLIs and SLOs

SLI: HTTP request success rate
SLO: 99.9% of requests succeed (30-day window)
Error budget: 0.1% = 43 minutes downtime/month

Day 6: Create alerts

# prometheus-alerts.yml
groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Error rate >5% on {{$labels.service}}"
          runbook: "https://wiki/runbooks/high-error-rate"

Day 7: Build dashboard

Panels to include:

  • Error rate (%)
  • Request rate (req/s)
  • p50/p95/p99 latency
  • CPU/memory utilization

Common Mistakes

❌ Logging in Production == Debugging in Production

Fix: Use structured logging with correlation IDs, not print statements


❌ Alerting on Predictions, Not Reality

Fix: Alert on actual user impact (errors, latency) not predicted issues (disk 70% full)


❌ Dashboard Sprawl

Fix: One main dashboard per service showing SLIs. Delete dashboards unused for 3 months.


❌ Ignoring Alert Feedback Loop

Fix: Track alert precision (% that led to action). Delete alerts with <50% precision.

Quick Reference

Getting Started:

  • Start with metrics (Prometheus + OpenTelemetry)
  • Add logs when debugging is hard (Loki)
  • Add traces when issues span services (Jaeger)

SLI Selection:

  • User-facing: Availability, latency, error rate
  • Background: Freshness, success rate, processing time

SLO Targets:

  • Start with 99% (achievable)
  • Increase to 99.9% only if business requires it
  • 99.99% is very expensive (4 nines = 52 min/year downtime)

Alerting:

  • Critical only = page
  • Warning = next business day
  • Info = dashboard only

Cost Control:

  • Sample traces (1-10%)
  • Control metric cardinality (no unbounded labels)
  • Set retention policies (7-30 days logs, 15 days metrics)

Tools:

  • Small: Prometheus + Grafana + Loki
  • Medium: Add Jaeger
  • Large: Consider Datadog, Grafana Cloud

Bottom Line

Start with metrics using OpenTelemetry + Prometheus. Define 3-5 SLIs based on user experience. Alert only on symptoms that require immediate action. Add logs and traces when metrics aren't enough.

Measure what users care about, not what's easy to measure.