Claude Code Plugins

Community-maintained marketplace

Feedback
0
0

Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name observability-sre
description Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices.

Observability & Site Reliability Engineering

Core Principles

  • Three Pillars — Metrics, Logs, and Traces provide holistic visibility
  • Observability-First — Build systems that explain their own behavior
  • SLO-Driven — Define reliability targets that matter to users
  • Proactive Detection — Find issues before customers do
  • Blameless Culture — Learn from failures without blame
  • Automate Toil — Reduce repetitive operational work
  • Continuous Improvement — Each incident makes systems more resilient
  • Full-Stack Visibility — Monitor from infrastructure to business metrics

Hard Rules (Must Follow)

These rules are mandatory. Violating them means the skill is not working correctly.

Symptom-Based Alerts Only

Alert on user-facing symptoms, not internal infrastructure metrics.

# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
  expr: cpu_usage > 70%
  # Users don't care about CPU, they care about latency

- alert: MemoryHigh
  expr: memory_usage > 80%
  # Internal metric, may not affect users

# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200
  annotations:
    summary: "Users experiencing slow response times"

- alert: ErrorRateHigh
  expr: slo:api_errors:rate5m > 0.001
  annotations:
    summary: "Users encountering errors"

Low Cardinality Labels

Loki/Prometheus labels must have low cardinality (<10 unique labels).

# ❌ FORBIDDEN: High cardinality labels
labels:
  user_id: "usr_123"      # Millions of values!
  order_id: "ord_456"     # Millions of values!
  request_id: "req_789"   # Every request is unique!

# ✅ REQUIRED: Low cardinality only
labels:
  namespace: "production"  # Few values
  app: "api-server"        # Few values
  level: "error"           # 5-6 values
  method: "GET"            # ~10 values

# High cardinality data goes in log body:
logger.info({
  user_id: "usr_123",      # In JSON body, not label
  order_id: "ord_456",
}, "Order processed");

SLO-Based Error Budgets

Every service must have defined SLOs with error budget tracking.

# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets

# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime

groups:
  - name: slo_tracking
    rules:
      - record: slo:api_availability:ratio
        expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

      - alert: ErrorBudgetBurnRate
        expr: slo:api_availability:ratio < 0.999
        for: 5m
        annotations:
          summary: "Burning error budget too fast"

Trace Context in Logs

All logs must include trace_id for correlation with distributed traces.

// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");

// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
  trace_id: span?.spanContext().traceId,
  span_id: span?.spanContext().spanId,
  order_id: "ord_123",
}, "Payment processed");

// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}

Quick Reference

When to Use What

Scenario Tool/Pattern Reason
Metrics collection Prometheus + Grafana Industry standard, powerful query language
Distributed tracing OpenTelemetry + Tempo/Jaeger Vendor-neutral, CNCF standard
Log aggregation (cost-sensitive) Grafana Loki Indexes only labels, 10x cheaper
Log aggregation (search-heavy) ELK Stack Full-text search, advanced analytics
Unified observability Elastic/Datadog/Dynatrace Single pane of glass for all telemetry
Incident management PagerDuty/Opsgenie Alert routing, on-call scheduling
Chaos engineering Gremlin/Chaos Mesh Controlled failure injection
AIOps/Anomaly detection Dynatrace/Datadog AI-driven root cause analysis

The Three Pillars

Pillar What When Tools
Metrics Numerical time-series data Real-time monitoring, alerting Prometheus, StatsD, CloudWatch
Logs Event records with context Debugging, audit trails Loki, ELK, Splunk
Traces Request journey across services Performance analysis, dependencies OpenTelemetry, Jaeger, Zipkin

Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)


Observability Architecture

Layered Prometheus Setup

# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down

# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)

# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level

# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level

# Global Prometheus config
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="kubernetes-nodes"}'
        - '{__name__=~"job:.*"}'  # Recording rules only
    static_configs:
      - targets:
        - 'cluster-prom-us-east.internal:9090'
        - 'cluster-prom-eu-west.internal:9090'

Recording Rules for Performance

# Precompute expensive queries
groups:
  - name: api_performance
    interval: 30s
    rules:
      # Request rate (requests per second)
      - record: job:api_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job, method, status)

      # Error rate
      - record: job:api_errors:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # P95 latency
      - record: job:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

Resource Optimization

# Increase scrape interval for high-target deployments
scrape_interval: 30s  # Default: 15s reduces load by 50%

# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_.*|process_.*'  # Drop Go runtime metrics
    action: drop

# Limit sample retention
storage:
  tsdb:
    retention.time: 15d  # Keep only 15 days locally
    retention.size: 50GB # Or max 50GB

Distributed Tracing with OpenTelemetry

Auto-Instrumentation Setup

// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
    }),
  ],
});

sdk.start();

Manual Instrumentation for Business Logic

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('payment-service', '1.0.0');

async function processPayment(orderId: string, amount: number) {
  // Create custom span for business operation
  return tracer.startActiveSpan('processPayment', async (span) => {
    try {
      // Add business context
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': 'USD',
      });

      // Child span for external API call
      const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
        const result = await stripe.charges.create({ amount, currency: 'usd' });
        childSpan.setAttribute('stripe.charge_id', result.id);
        childSpan.setStatus({ code: SpanStatusCode.OK });
        childSpan.end();
        return result;
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return paymentResult;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      throw error;
    } finally {
      span.end();
    }
  });
}

Sampling Strategies

# OpenTelemetry Collector config
processors:
  # Probabilistic sampling: Keep 10% of traces
  probabilistic_sampler:
    sampling_percentage: 10

  # Tail sampling: Make decisions after seeing full trace
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-traces
        type: status_code
        status_code: {status_codes: [ERROR]}

      # Always sample slow requests
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}

      # Sample 5% of normal traffic
      - name: normal-traces
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Context Propagation

// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';

// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
  headers: {
    // W3C Trace Context headers injected automatically:
    // traceparent: 00-<trace-id>-<span-id>-01
    // tracestate: vendor=value
  },
});

// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });

Structured Logging Best Practices

JSON Logging Format

// Use structured logging library
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Include trace context in logs
  mixin() {
    const span = trace.getActiveSpan();
    if (!span) return {};

    const { traceId, spanId } = span.spanContext();
    return {
      trace_id: traceId,
      span_id: spanId,
    };
  },
});

// Structured logging with context
logger.info(
  {
    user_id: '123',
    order_id: 'ord_456',
    amount: 99.99,
    payment_method: 'card',
  },
  'Payment processed successfully'
);

// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}

Log Levels

// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging');     // Very verbose
logger.debug({ state }, 'Debug information');          // Development
logger.info({ event }, 'Normal operation');            // Production default
logger.warn({ issue }, 'Warning condition');           // Potential issues
logger.error({ error, context }, 'Error occurred');    // Errors
logger.fatal({ critical }, 'Fatal error');             // Process crash

Grafana Loki Configuration

# Promtail config - ships logs to Loki
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add pod labels as Loki labels (LOW cardinality only!)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
    pipeline_stages:
      # Parse JSON logs
      - json:
          expressions:
            level: level
            trace_id: trace_id
      # Extract fields as labels
      - labels:
          level:
          trace_id:

Loki Best Practices

  • Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
  • High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
  • LogQL for Filtering — Use {app="api"} | json | user_id="123"
  • Retention Policy — Keep recent logs longer, compress old logs
# LogQL query examples
{namespace="production", app="api"} |= "error"  # Text search

{app="api"} | json | level="error" | line_format "{{.msg}}"  # JSON parsing

rate({app="api"}[5m])  # Log rate per second

sum by (level) (count_over_time({namespace="production"}[1h]))  # Count by level

SLO/SLI/SLA Management

Definitions

  • SLI (Service Level Indicator) — Quantifiable measurement of service behavior

    • Examples: Request latency, error rate, availability, throughput
  • SLO (Service Level Objective) — Target value/range for an SLI

    • Examples: 99.9% availability, P95 latency < 200ms
  • SLA (Service Level Agreement) — Formal commitment with consequences

    • Examples: "99.9% uptime or 10% credit"

The Four Golden Signals

# Google SRE's key metrics for any service

1. Latency
   SLI: P95 request latency
   SLO: 95% of requests complete in < 200ms

2. Traffic
   SLI: Requests per second
   SLO: Handle 10,000 req/s peak load

3. Errors
   SLI: Error rate (5xx / total)
   SLO: < 0.1% error rate

4. Saturation
   SLI: Resource utilization (CPU, memory, disk)
   SLO: CPU < 70%, Memory < 80%

Error Budget

# Error budget = 1 - SLO
SLO = 99.9%  # "three nines"
Error_Budget = 100% - 99.9% = 0.1%

# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes

# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%

# Policy: If budget > 90% consumed, freeze deployments

SLO Implementation with Prometheus

# Recording rules for SLI calculation
groups:
  - name: slo_availability
    interval: 30s
    rules:
      # Total requests
      - record: slo:api_requests:total
        expr: sum(rate(http_requests_total[5m]))

      # Successful requests (non-5xx)
      - record: slo:api_requests:success
        expr: sum(rate(http_requests_total{status!~"5.."}[5m]))

      # Availability SLI
      - record: slo:api_availability:ratio
        expr: slo:api_requests:success / slo:api_requests:total

      # 30-day availability
      - record: slo:api_availability:30d
        expr: avg_over_time(slo:api_availability:ratio[30d])

  - name: slo_latency
    interval: 30s
    rules:
      # P95 latency SLI
      - record: slo:api_latency:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      slo:api_availability:ratio < 0.999  # Below 99.9% SLO
      and
      slo:api_availability:30d > 0.999    # But 30-day average still OK
    )
  for: 5m
  annotations:
    summary: "Burning error budget too fast"
    description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"

Incident Response

Incident Severity Levels

Level Impact Response Time Examples
SEV-1 Service down or major degradation < 15 min Complete outage, data loss, security breach
SEV-2 Significant impact, partial outage < 1 hour Feature unavailable, high error rates
SEV-3 Minor impact, workaround exists < 4 hours Single component degraded, slow performance
SEV-4 Cosmetic, no user impact Next business day UI glitches, logging errors

Incident Response Roles (IMAG Framework)

Incident Commander (IC):
  - Overall coordination and decision-making
  - Declares incident start/end
  - Decides on escalations
  - Owns communication to leadership

Operations Lead (OL):
  - Technical investigation and mitigation
  - Coordinates engineers
  - Implements fixes
  - Reports status to IC

Communications Lead (CL):
  - Internal/external status updates
  - Customer communication
  - Stakeholder notifications
  - Status page updates

Incident Workflow

1. Detection (Alert fires or user reports)
   ↓
2. Triage (Assess severity, assign IC)
   ↓
3. Response (Assemble team, create war room)
   ↓
4. Mitigation (Stop the bleeding, restore service)
   ↓
5. Resolution (Fix root cause)
   ↓
6. Postmortem (Blameless review, action items)
   ↓
7. Follow-up (Implement improvements)

On-Call Best Practices

  • Rotation — 1-week shifts, balanced across timezones
  • Escalation — Primary → Secondary → Manager (15 min each)
  • Playbooks — Step-by-step debugging guides for common issues
  • Runbooks — Automated remediation scripts
  • Handoff — 15-min sync at rotation change
  • Compensation — On-call pay or comp time
  • Health — No more than 2 incidents/night target

Alert Fatigue Prevention

# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken

# GOOD: Symptom-based alert
- alert: APILatencyHigh
  expr: slo:api_latency:p95 > 0.200  # User-facing metric
  annotations:
    summary: "API is slow for users"

# BAD: Cause-based alert
- alert: CPUHigh
  expr: cpu_usage > 70%  # Internal metric, might not impact users
  # Don't alert unless this affects SLOs

# Use SLO-based alerting
# Alert when error budget burn rate is too high

Blameless Postmortems

Core Principles

  • Assume Good Intentions — Everyone did their best with available information
  • Focus on Systems — Identify gaps in process/tooling, not people
  • Psychological Safety — No punishment for honest mistakes
  • Learning Culture — Incidents are opportunities to improve
  • Separate from Performance Reviews — Postmortem participation never affects evaluations

Postmortem Template

# Incident Postmortem: [Title]

**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson

## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss

## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring

## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.

## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss

## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics

## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)

## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just pods

Follow-up

  • Review postmortem in team meeting within 1 week
  • Track action items to completion (not optional!)
  • Share learnings across teams
  • Update runbooks and playbooks
  • Celebrate successful incident response

Chaos Engineering

Principles

  1. Define Steady State — Normal system behavior (e.g., 99.9% success rate)
  2. Hypothesize — Predict system will remain stable under failure
  3. Inject Failures — Simulate real-world events
  4. Disprove Hypothesis — Look for deviations from steady state
  5. Learn and Improve — Fix weaknesses, increase resilience

Failure Types

Infrastructure:
  - Pod/node termination
  - Network latency/packet loss
  - DNS failures
  - Cloud region outage

Resources:
  - CPU stress
  - Memory exhaustion
  - Disk I/O saturation
  - File descriptor limits

Dependencies:
  - Database connection failures
  - API timeout/errors
  - Cache unavailability
  - Message queue backlog

Security:
  - DDoS simulation
  - Certificate expiration
  - Unauthorized access attempts

Chaos Mesh Example

# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-service
  delay:
    latency: "100ms"
    correlation: "50"
    jitter: "50ms"
  duration: "5m"
  scheduler:
    cron: "@every 2h"  # Run every 2 hours

---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill
spec:
  action: pod-kill
  mode: fixed-percent
  value: "10"  # Kill 10% of pods
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api-server
  duration: "30s"

Best Practices

  • Start Small — Non-production first, then canary production
  • Collect Baselines — Know normal metrics before experiments
  • Define Success — Clear criteria for what "stable" means
  • Monitor Everything — Watch metrics, logs, traces during tests
  • Automate Rollback — Stop experiment if SLOs violated
  • Game Days — Scheduled chaos exercises with full team
  • Blameless Reviews — Treat chaos failures like production incidents

AIOps and AI in Observability

2025 Trends

  • Anomaly Detection — AI spots unusual patterns in metrics/logs
  • Root Cause Analysis — Correlate failures across services automatically
  • Predictive Alerting — Predict failures before they happen
  • Auto-Remediation — AI suggests or applies fixes autonomously
  • Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
  • AI Observability — Monitor AI model drift, hallucinations, token usage

AI-Driven Platforms (2025)

Dynatrace Davis AI:
  - Auto-detected 73% of incidents before customer impact
  - Reduced alert noise by 90%
  - Causal AI for root cause analysis

Datadog Watchdog:
  - Anomaly detection across metrics, logs, traces
  - Automated correlation of related issues
  - LLM-powered investigation assistant

Elastic AIOps:
  - Machine learning for log anomaly detection
  - Automated baseline learning
  - Predictive alerting

New Relic AI:
  - Natural language query interface
  - Automated incident summarization
  - Proactive capacity recommendations

Implementing AI Observability

# Monitor AI model performance
from opentelemetry import trace, metrics

tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)

# Create metrics for AI model
model_latency = meter.create_histogram(
    "ai.model.latency",
    description="AI model inference latency",
    unit="ms"
)
model_tokens = meter.create_counter(
    "ai.model.tokens",
    description="Token usage"
)

async def run_ai_model(prompt: str):
    with tracer.start_as_current_span("ai.inference") as span:
        start = time.time()

        span.set_attribute("ai.model", "gpt-4")
        span.set_attribute("ai.prompt_length", len(prompt))

        response = await openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        latency = (time.time() - start) * 1000
        tokens = response.usage.total_tokens

        # Record metrics
        model_latency.record(latency, {"model": "gpt-4"})
        model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})

        # Add to span
        span.set_attribute("ai.response_length", len(response.choices[0].message.content))
        span.set_attribute("ai.tokens_used", tokens)

        return response

Grafana Dashboards

3-3-3 Rule

  • 3 rows of panels per dashboard
  • 3 panels per row
  • 3 key metrics per panel

Avoid "dashboard sprawl" — Each dashboard should answer ONE question.

Dashboard Categories

RED Dashboard (for services):
  - Rate: Requests per second
  - Errors: Error rate
  - Duration: Latency (P50, P95, P99)

USE Dashboard (for resources):
  - Utilization: % of capacity used
  - Saturation: Queue depth, wait time
  - Errors: Error count

Four Golden Signals Dashboard:
  - Latency
  - Traffic
  - Errors
  - Saturation

SLO Dashboard:
  - Current SLI value
  - Error budget remaining
  - Burn rate
  - Trend (30-day)

Panel Best Practices

{
  "title": "API Request Rate",
  "type": "graph",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total[5m])) by (method)",
      "legendFormat": "{{ method }}"
    }
  ],
  "options": {
    "tooltip": { "mode": "multi" },
    "legend": { "displayMode": "table", "calcs": ["mean", "last"] }
  },
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",  // Requests per second
      "color": { "mode": "palette-classic" },
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    }
  }
}

Checklist

## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics

## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)

## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined

## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process

## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready

## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises

See Also