| name | observability-sre |
| description | Observability and SRE expert. Use when setting up monitoring, logging, tracing, defining SLOs, or managing incidents. Covers Prometheus, Grafana, OpenTelemetry, and incident response best practices. |
Observability & Site Reliability Engineering
Core Principles
- Three Pillars — Metrics, Logs, and Traces provide holistic visibility
- Observability-First — Build systems that explain their own behavior
- SLO-Driven — Define reliability targets that matter to users
- Proactive Detection — Find issues before customers do
- Blameless Culture — Learn from failures without blame
- Automate Toil — Reduce repetitive operational work
- Continuous Improvement — Each incident makes systems more resilient
- Full-Stack Visibility — Monitor from infrastructure to business metrics
Hard Rules (Must Follow)
These rules are mandatory. Violating them means the skill is not working correctly.
Symptom-Based Alerts Only
Alert on user-facing symptoms, not internal infrastructure metrics.
# ❌ FORBIDDEN: Alerting on internal metrics
- alert: CPUHigh
expr: cpu_usage > 70%
# Users don't care about CPU, they care about latency
- alert: MemoryHigh
expr: memory_usage > 80%
# Internal metric, may not affect users
# ✅ REQUIRED: Alert on user experience
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200
annotations:
summary: "Users experiencing slow response times"
- alert: ErrorRateHigh
expr: slo:api_errors:rate5m > 0.001
annotations:
summary: "Users encountering errors"
Low Cardinality Labels
Loki/Prometheus labels must have low cardinality (<10 unique labels).
# ❌ FORBIDDEN: High cardinality labels
labels:
user_id: "usr_123" # Millions of values!
order_id: "ord_456" # Millions of values!
request_id: "req_789" # Every request is unique!
# ✅ REQUIRED: Low cardinality only
labels:
namespace: "production" # Few values
app: "api-server" # Few values
level: "error" # 5-6 values
method: "GET" # ~10 values
# High cardinality data goes in log body:
logger.info({
user_id: "usr_123", # In JSON body, not label
order_id: "ord_456",
}, "Order processed");
SLO-Based Error Budgets
Every service must have defined SLOs with error budget tracking.
# ❌ FORBIDDEN: No SLO definition
# Just monitoring without targets
# ✅ REQUIRED: Explicit SLO with budget
# SLO: 99.9% availability
# Error Budget: 0.1% = 43.2 minutes/month downtime
groups:
- name: slo_tracking
rules:
- record: slo:api_availability:ratio
expr: sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- alert: ErrorBudgetBurnRate
expr: slo:api_availability:ratio < 0.999
for: 5m
annotations:
summary: "Burning error budget too fast"
Trace Context in Logs
All logs must include trace_id for correlation with distributed traces.
// ❌ FORBIDDEN: Logs without trace context
logger.info("Payment processed");
// ✅ REQUIRED: Include trace_id in every log
const span = trace.getActiveSpan();
logger.info({
trace_id: span?.spanContext().traceId,
span_id: span?.spanContext().spanId,
order_id: "ord_123",
}, "Payment processed");
// Output includes correlation:
// {"trace_id":"abc123","span_id":"def456","order_id":"ord_123","msg":"Payment processed"}
Quick Reference
When to Use What
| Scenario | Tool/Pattern | Reason |
|---|---|---|
| Metrics collection | Prometheus + Grafana | Industry standard, powerful query language |
| Distributed tracing | OpenTelemetry + Tempo/Jaeger | Vendor-neutral, CNCF standard |
| Log aggregation (cost-sensitive) | Grafana Loki | Indexes only labels, 10x cheaper |
| Log aggregation (search-heavy) | ELK Stack | Full-text search, advanced analytics |
| Unified observability | Elastic/Datadog/Dynatrace | Single pane of glass for all telemetry |
| Incident management | PagerDuty/Opsgenie | Alert routing, on-call scheduling |
| Chaos engineering | Gremlin/Chaos Mesh | Controlled failure injection |
| AIOps/Anomaly detection | Dynatrace/Datadog | AI-driven root cause analysis |
The Three Pillars
| Pillar | What | When | Tools |
|---|---|---|---|
| Metrics | Numerical time-series data | Real-time monitoring, alerting | Prometheus, StatsD, CloudWatch |
| Logs | Event records with context | Debugging, audit trails | Loki, ELK, Splunk |
| Traces | Request journey across services | Performance analysis, dependencies | OpenTelemetry, Jaeger, Zipkin |
Fourth Pillar (Emerging): Continuous Profiling — Code-level performance data (CPU, memory usage at function level)
Observability Architecture
Layered Prometheus Setup
# 2025 Best Practice: Federated architecture
# Prevents metric chaos while enabling drill-down
# Layer 1: Application Prometheus
# - Detailed business logic metrics
# - High cardinality acceptable
# - Short retention (7 days)
# Layer 2: Cluster Prometheus
# - Per-environment/cluster metrics
# - Medium retention (30 days)
# - Aggregates from application level
# Layer 3: Global Prometheus
# - Cross-cluster critical metrics
# - Long retention (1 year)
# - Federation from cluster level
# Global Prometheus config
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="kubernetes-nodes"}'
- '{__name__=~"job:.*"}' # Recording rules only
static_configs:
- targets:
- 'cluster-prom-us-east.internal:9090'
- 'cluster-prom-eu-west.internal:9090'
Recording Rules for Performance
# Precompute expensive queries
groups:
- name: api_performance
interval: 30s
rules:
# Request rate (requests per second)
- record: job:api_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job, method, status)
# Error rate
- record: job:api_errors:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# P95 latency
- record: job:api_latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Resource Optimization
# Increase scrape interval for high-target deployments
scrape_interval: 30s # Default: 15s reduces load by 50%
# Use relabeling to drop unnecessary metrics
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_.*|process_.*' # Drop Go runtime metrics
action: drop
# Limit sample retention
storage:
tsdb:
retention.time: 15d # Keep only 15 days locally
retention.size: 50GB # Or max 50GB
Distributed Tracing with OpenTelemetry
Auto-Instrumentation Setup
// Node.js auto-instrumentation
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instruments HTTP, Express, PostgreSQL, Redis, etc.
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
}),
],
});
sdk.start();
Manual Instrumentation for Business Logic
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('payment-service', '1.0.0');
async function processPayment(orderId: string, amount: number) {
// Create custom span for business operation
return tracer.startActiveSpan('processPayment', async (span) => {
try {
// Add business context
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.currency': 'USD',
});
// Child span for external API call
const paymentResult = await tracer.startActiveSpan('stripe.charge', async (childSpan) => {
const result = await stripe.charges.create({ amount, currency: 'usd' });
childSpan.setAttribute('stripe.charge_id', result.id);
childSpan.setStatus({ code: SpanStatusCode.OK });
childSpan.end();
return result;
});
span.setStatus({ code: SpanStatusCode.OK });
return paymentResult;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
});
}
Sampling Strategies
# OpenTelemetry Collector config
processors:
# Probabilistic sampling: Keep 10% of traces
probabilistic_sampler:
sampling_percentage: 10
# Tail sampling: Make decisions after seeing full trace
tail_sampling:
policies:
# Always sample errors
- name: error-traces
type: status_code
status_code: {status_codes: [ERROR]}
# Always sample slow requests
- name: slow-traces
type: latency
latency: {threshold_ms: 1000}
# Sample 5% of normal traffic
- name: normal-traces
type: probabilistic
probabilistic: {sampling_percentage: 5}
Context Propagation
// Ensure trace context flows across services
import { propagation, context } from '@opentelemetry/api';
// Outgoing HTTP request (automatic with auto-instrumentation)
fetch('https://api.example.com/data', {
headers: {
// W3C Trace Context headers injected automatically:
// traceparent: 00-<trace-id>-<span-id>-01
// tracestate: vendor=value
},
});
// Manual propagation for non-HTTP (e.g., message queues)
const carrier = {};
propagation.inject(context.active(), carrier);
await publishMessage(queue, { data: payload, headers: carrier });
Structured Logging Best Practices
JSON Logging Format
// Use structured logging library
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
timestamp: pino.stdTimeFunctions.isoTime,
// Include trace context in logs
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return {
trace_id: traceId,
span_id: spanId,
};
},
});
// Structured logging with context
logger.info(
{
user_id: '123',
order_id: 'ord_456',
amount: 99.99,
payment_method: 'card',
},
'Payment processed successfully'
);
// Output:
// {"level":"info","time":"2025-01-15T10:30:00.000Z","trace_id":"abc123","span_id":"def456","user_id":"123","order_id":"ord_456","amount":99.99,"payment_method":"card","msg":"Payment processed successfully"}
Log Levels
// Follow standard severity levels
logger.trace({ details }, 'Low-level debugging'); // Very verbose
logger.debug({ state }, 'Debug information'); // Development
logger.info({ event }, 'Normal operation'); // Production default
logger.warn({ issue }, 'Warning condition'); // Potential issues
logger.error({ error, context }, 'Error occurred'); // Errors
logger.fatal({ critical }, 'Fatal error'); // Process crash
Grafana Loki Configuration
# Promtail config - ships logs to Loki
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add pod labels as Loki labels (LOW cardinality only!)
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
pipeline_stages:
# Parse JSON logs
- json:
expressions:
level: level
trace_id: trace_id
# Extract fields as labels
- labels:
level:
trace_id:
Loki Best Practices
- Low Cardinality Labels — Use only 5-10 labels (namespace, app, level)
- High Cardinality in Log Body — Put user_id, order_id in JSON, not labels
- LogQL for Filtering — Use
{app="api"} | json | user_id="123" - Retention Policy — Keep recent logs longer, compress old logs
# LogQL query examples
{namespace="production", app="api"} |= "error" # Text search
{app="api"} | json | level="error" | line_format "{{.msg}}" # JSON parsing
rate({app="api"}[5m]) # Log rate per second
sum by (level) (count_over_time({namespace="production"}[1h])) # Count by level
SLO/SLI/SLA Management
Definitions
SLI (Service Level Indicator) — Quantifiable measurement of service behavior
- Examples: Request latency, error rate, availability, throughput
SLO (Service Level Objective) — Target value/range for an SLI
- Examples: 99.9% availability, P95 latency < 200ms
SLA (Service Level Agreement) — Formal commitment with consequences
- Examples: "99.9% uptime or 10% credit"
The Four Golden Signals
# Google SRE's key metrics for any service
1. Latency
SLI: P95 request latency
SLO: 95% of requests complete in < 200ms
2. Traffic
SLI: Requests per second
SLO: Handle 10,000 req/s peak load
3. Errors
SLI: Error rate (5xx / total)
SLO: < 0.1% error rate
4. Saturation
SLI: Resource utilization (CPU, memory, disk)
SLO: CPU < 70%, Memory < 80%
Error Budget
# Error budget = 1 - SLO
SLO = 99.9% # "three nines"
Error_Budget = 100% - 99.9% = 0.1%
# Monthly calculation (30 days)
Total_Minutes = 30 * 24 * 60 = 43,200 minutes
Allowed_Downtime = 43,200 * 0.001 = 43.2 minutes
# If you've had 20 minutes downtime this month:
Budget_Remaining = 43.2 - 20 = 23.2 minutes
Budget_Consumed = 20 / 43.2 = 46.3%
# Policy: If budget > 90% consumed, freeze deployments
SLO Implementation with Prometheus
# Recording rules for SLI calculation
groups:
- name: slo_availability
interval: 30s
rules:
# Total requests
- record: slo:api_requests:total
expr: sum(rate(http_requests_total[5m]))
# Successful requests (non-5xx)
- record: slo:api_requests:success
expr: sum(rate(http_requests_total{status!~"5.."}[5m]))
# Availability SLI
- record: slo:api_availability:ratio
expr: slo:api_requests:success / slo:api_requests:total
# 30-day availability
- record: slo:api_availability:30d
expr: avg_over_time(slo:api_availability:ratio[30d])
- name: slo_latency
interval: 30s
rules:
# P95 latency SLI
- record: slo:api_latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Alerting on SLO burn rate
- alert: HighErrorBudgetBurnRate
expr: |
(
slo:api_availability:ratio < 0.999 # Below 99.9% SLO
and
slo:api_availability:30d > 0.999 # But 30-day average still OK
)
for: 5m
annotations:
summary: "Burning error budget too fast"
description: "Current availability {{ $value }} is below SLO. {{ $labels.service }}"
Incident Response
Incident Severity Levels
| Level | Impact | Response Time | Examples |
|---|---|---|---|
| SEV-1 | Service down or major degradation | < 15 min | Complete outage, data loss, security breach |
| SEV-2 | Significant impact, partial outage | < 1 hour | Feature unavailable, high error rates |
| SEV-3 | Minor impact, workaround exists | < 4 hours | Single component degraded, slow performance |
| SEV-4 | Cosmetic, no user impact | Next business day | UI glitches, logging errors |
Incident Response Roles (IMAG Framework)
Incident Commander (IC):
- Overall coordination and decision-making
- Declares incident start/end
- Decides on escalations
- Owns communication to leadership
Operations Lead (OL):
- Technical investigation and mitigation
- Coordinates engineers
- Implements fixes
- Reports status to IC
Communications Lead (CL):
- Internal/external status updates
- Customer communication
- Stakeholder notifications
- Status page updates
Incident Workflow
1. Detection (Alert fires or user reports)
↓
2. Triage (Assess severity, assign IC)
↓
3. Response (Assemble team, create war room)
↓
4. Mitigation (Stop the bleeding, restore service)
↓
5. Resolution (Fix root cause)
↓
6. Postmortem (Blameless review, action items)
↓
7. Follow-up (Implement improvements)
On-Call Best Practices
- Rotation — 1-week shifts, balanced across timezones
- Escalation — Primary → Secondary → Manager (15 min each)
- Playbooks — Step-by-step debugging guides for common issues
- Runbooks — Automated remediation scripts
- Handoff — 15-min sync at rotation change
- Compensation — On-call pay or comp time
- Health — No more than 2 incidents/night target
Alert Fatigue Prevention
# Symptoms vs Causes alerting
# Alert on WHAT users experience, not WHY it's broken
# GOOD: Symptom-based alert
- alert: APILatencyHigh
expr: slo:api_latency:p95 > 0.200 # User-facing metric
annotations:
summary: "API is slow for users"
# BAD: Cause-based alert
- alert: CPUHigh
expr: cpu_usage > 70% # Internal metric, might not impact users
# Don't alert unless this affects SLOs
# Use SLO-based alerting
# Alert when error budget burn rate is too high
Blameless Postmortems
Core Principles
- Assume Good Intentions — Everyone did their best with available information
- Focus on Systems — Identify gaps in process/tooling, not people
- Psychological Safety — No punishment for honest mistakes
- Learning Culture — Incidents are opportunities to improve
- Separate from Performance Reviews — Postmortem participation never affects evaluations
Postmortem Template
# Incident Postmortem: [Title]
**Date:** 2025-01-15
**Duration:** 10:30 - 12:15 UTC (1h 45m)
**Severity:** SEV-2
**Incident Commander:** Jane Doe
**Responders:** John Smith, Alice Johnson
## Impact
- 15,000 users affected
- 12% error rate on payment processing
- $5,000 estimated revenue impact
- No data loss
## Timeline (UTC)
- 10:30 - Alert: Payment error rate > 5%
- 10:32 - IC assigned, war room created
- 10:45 - Identified: Database connection pool exhausted
- 11:00 - Mitigation: Increased pool size from 50 → 100
- 11:15 - Error rate back to normal
- 12:15 - Incident closed after monitoring
## Root Cause
Database connection pool configured for average load, not peak traffic.
Black Friday traffic spike (3x normal) exhausted connections.
## What Went Well
- Alert fired within 2 minutes of issue
- Clear escalation path, IC available immediately
- Mitigation applied quickly (30 minutes to fix)
- No data corruption or loss
## What Went Wrong
- No load testing at 3x scale
- No auto-scaling for connection pool
- No alert on connection pool saturation
- Insufficient monitoring of database metrics
## Action Items
- [ ] (@john) Add connection pool metrics to Grafana (Due: Jan 20)
- [ ] (@alice) Implement auto-scaling based on request rate (Due: Jan 25)
- [ ] (@jane) Add load testing to CI for 5x scale (Due: Feb 1)
- [ ] (@jane) Add alert: connection pool > 80% (Due: Jan 18)
- [ ] (@john) Document connection pool tuning runbook (Due: Jan 22)
## Lessons Learned
1. Black Friday load patterns need dedicated testing
2. Database metrics were missing from standard dashboards
3. Auto-scaling should cover ALL resources, not just pods
Follow-up
- Review postmortem in team meeting within 1 week
- Track action items to completion (not optional!)
- Share learnings across teams
- Update runbooks and playbooks
- Celebrate successful incident response
Chaos Engineering
Principles
- Define Steady State — Normal system behavior (e.g., 99.9% success rate)
- Hypothesize — Predict system will remain stable under failure
- Inject Failures — Simulate real-world events
- Disprove Hypothesis — Look for deviations from steady state
- Learn and Improve — Fix weaknesses, increase resilience
Failure Types
Infrastructure:
- Pod/node termination
- Network latency/packet loss
- DNS failures
- Cloud region outage
Resources:
- CPU stress
- Memory exhaustion
- Disk I/O saturation
- File descriptor limits
Dependencies:
- Database connection failures
- API timeout/errors
- Cache unavailability
- Message queue backlog
Security:
- DDoS simulation
- Certificate expiration
- Unauthorized access attempts
Chaos Mesh Example
# Network latency injection
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: payment-service
delay:
latency: "100ms"
correlation: "50"
jitter: "50ms"
duration: "5m"
scheduler:
cron: "@every 2h" # Run every 2 hours
---
# Pod kill experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill
spec:
action: pod-kill
mode: fixed-percent
value: "10" # Kill 10% of pods
selector:
namespaces:
- production
labelSelectors:
app: api-server
duration: "30s"
Best Practices
- Start Small — Non-production first, then canary production
- Collect Baselines — Know normal metrics before experiments
- Define Success — Clear criteria for what "stable" means
- Monitor Everything — Watch metrics, logs, traces during tests
- Automate Rollback — Stop experiment if SLOs violated
- Game Days — Scheduled chaos exercises with full team
- Blameless Reviews — Treat chaos failures like production incidents
AIOps and AI in Observability
2025 Trends
- Anomaly Detection — AI spots unusual patterns in metrics/logs
- Root Cause Analysis — Correlate failures across services automatically
- Predictive Alerting — Predict failures before they happen
- Auto-Remediation — AI suggests or applies fixes autonomously
- Natural Language Queries — Ask "Why is checkout slow?" instead of writing PromQL
- AI Observability — Monitor AI model drift, hallucinations, token usage
AI-Driven Platforms (2025)
Dynatrace Davis AI:
- Auto-detected 73% of incidents before customer impact
- Reduced alert noise by 90%
- Causal AI for root cause analysis
Datadog Watchdog:
- Anomaly detection across metrics, logs, traces
- Automated correlation of related issues
- LLM-powered investigation assistant
Elastic AIOps:
- Machine learning for log anomaly detection
- Automated baseline learning
- Predictive alerting
New Relic AI:
- Natural language query interface
- Automated incident summarization
- Proactive capacity recommendations
Implementing AI Observability
# Monitor AI model performance
from opentelemetry import trace, metrics
tracer = trace.get_tracer(__name__)
meter = metrics.get_meter(__name__)
# Create metrics for AI model
model_latency = meter.create_histogram(
"ai.model.latency",
description="AI model inference latency",
unit="ms"
)
model_tokens = meter.create_counter(
"ai.model.tokens",
description="Token usage"
)
async def run_ai_model(prompt: str):
with tracer.start_as_current_span("ai.inference") as span:
start = time.time()
span.set_attribute("ai.model", "gpt-4")
span.set_attribute("ai.prompt_length", len(prompt))
response = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
latency = (time.time() - start) * 1000
tokens = response.usage.total_tokens
# Record metrics
model_latency.record(latency, {"model": "gpt-4"})
model_tokens.add(tokens, {"model": "gpt-4", "type": "total"})
# Add to span
span.set_attribute("ai.response_length", len(response.choices[0].message.content))
span.set_attribute("ai.tokens_used", tokens)
return response
Grafana Dashboards
3-3-3 Rule
- 3 rows of panels per dashboard
- 3 panels per row
- 3 key metrics per panel
Avoid "dashboard sprawl" — Each dashboard should answer ONE question.
Dashboard Categories
RED Dashboard (for services):
- Rate: Requests per second
- Errors: Error rate
- Duration: Latency (P50, P95, P99)
USE Dashboard (for resources):
- Utilization: % of capacity used
- Saturation: Queue depth, wait time
- Errors: Error count
Four Golden Signals Dashboard:
- Latency
- Traffic
- Errors
- Saturation
SLO Dashboard:
- Current SLI value
- Error budget remaining
- Burn rate
- Trend (30-day)
Panel Best Practices
{
"title": "API Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (method)",
"legendFormat": "{{ method }}"
}
],
"options": {
"tooltip": { "mode": "multi" },
"legend": { "displayMode": "table", "calcs": ["mean", "last"] }
},
"fieldConfig": {
"defaults": {
"unit": "reqps", // Requests per second
"color": { "mode": "palette-classic" },
"custom": {
"lineWidth": 2,
"fillOpacity": 10
}
}
}
}
Checklist
## Metrics (Prometheus + Grafana)
- [ ] Layered architecture (app/cluster/global)
- [ ] Recording rules for expensive queries
- [ ] Resource limits and retention configured
- [ ] Dashboards follow 3-3-3 rule
- [ ] Alerts based on SLOs, not internal metrics
## Tracing (OpenTelemetry)
- [ ] Auto-instrumentation enabled
- [ ] Custom spans for business operations
- [ ] Sampling strategy configured
- [ ] Trace context in logs (correlation)
- [ ] Backend connected (Tempo/Jaeger)
## Logging (Loki/ELK)
- [ ] Structured JSON logging
- [ ] Low cardinality labels (<10)
- [ ] Trace IDs in logs
- [ ] Appropriate log levels
- [ ] Retention policy defined
## SLOs
- [ ] SLIs defined for key user journeys
- [ ] SLOs documented and tracked
- [ ] Error budget calculated
- [ ] Burn rate alerting configured
- [ ] Monthly SLO review process
## Incident Response
- [ ] Severity levels defined
- [ ] On-call rotation scheduled
- [ ] Escalation policy documented
- [ ] Runbooks for common issues
- [ ] Postmortem template ready
## Culture
- [ ] Blameless postmortem process
- [ ] Action items tracked to completion
- [ ] Incident learnings shared
- [ ] On-call compensation policy
- [ ] Regular chaos engineering exercises
See Also
- reference/monitoring.md — Prometheus and Grafana deep dive
- reference/logging.md — Structured logging best practices
- reference/tracing.md — OpenTelemetry and distributed tracing
- reference/incident-response.md — Incident management and postmortems
- templates/slo-template.md — SLO definition template