| name | observability-monitoring |
| description | Comprehensive observability and monitoring skill covering Prometheus, Grafana, metrics collection, alerting, exporters, PromQL, and production monitoring patterns for distributed systems and cloud-native applications |
Observability & Monitoring
A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.
When to Use This Skill
Use this skill when:
- Setting up monitoring for production systems and applications
- Implementing metrics collection and observability for microservices
- Creating dashboards and visualizations for system health monitoring
- Defining alerting rules and incident response automation
- Analyzing system performance and capacity using time-series data
- Implementing SLIs, SLOs, and SLAs for service reliability
- Debugging production issues using metrics and traces
- Building custom exporters for application-specific metrics
- Setting up federation for multi-cluster monitoring
- Migrating from legacy monitoring to cloud-native solutions
- Implementing cost monitoring and optimization tracking
- Creating real-time operational dashboards for DevOps teams
Core Concepts
The Four Pillars of Observability
Modern observability is built on four fundamental pillars:
Metrics: Numerical measurements of system behavior over time
- Counter: Monotonically increasing values (requests served, errors)
- Gauge: Point-in-time values that go up and down (memory usage, temperature)
- Histogram: Distribution of values (request duration buckets)
- Summary: Similar to histogram but calculates quantiles on client-side
Logs: Discrete events with contextual information
- Structured logging (JSON, key-value pairs)
- Centralized log aggregation (ELK, Loki)
- Correlation with metrics and traces
Traces: Request flow through distributed systems
- Span: Single unit of work with start/end time
- Trace: Collection of spans representing end-to-end request
- OpenTelemetry for distributed tracing
Events: Significant occurrences in system lifecycle
- Deployments, configuration changes
- Scaling events, incidents
- Business events and user actions
Prometheus Architecture
Prometheus is a pull-based monitoring system with key components:
Time-Series Database (TSDB)
- Stores metrics as time-series data
- Efficient compression and retention policies
- Local storage with optional remote storage
Scrape Targets
- Service discovery (Kubernetes, Consul, EC2, etc.)
- Static configuration
- Relabeling for flexible target selection
PromQL Query Engine
- Powerful query language for metrics analysis
- Aggregation, filtering, and mathematical operations
- Range vectors and instant vectors
Alertmanager
- Alert rule evaluation
- Grouping, silencing, and routing
- Integration with PagerDuty, Slack, email, webhooks
Exporters
- Bridge between Prometheus and systems
- Node exporter, cAdvisor, custom exporters
- Third-party exporters for databases, services
Metric Labels and Cardinality
Labels are key-value pairs attached to metrics:
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Label Best Practices:
- Use labels for dimensions you query/aggregate by
- Avoid high-cardinality labels (user IDs, timestamps)
- Keep label names consistent across metrics
- Use relabeling to normalize external labels
Cardinality Considerations:
- Each unique label combination = new time-series
- High cardinality = increased memory and storage
- Monitor cardinality with
prometheus_tsdb_symbol_table_size_bytes - Use recording rules to pre-aggregate high-cardinality metrics
Recording Rules
Pre-compute frequently-used or expensive queries:
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m
expr: rate(http_requests_total[5m])
Benefits:
- Faster dashboard loading
- Reduced query load on Prometheus
- Consistent metric naming conventions
- Enable complex aggregations
Service Level Objectives (SLOs)
Define and track reliability targets:
SLI (Service Level Indicator): Metric measuring service quality
- Availability: % of successful requests
- Latency: % of requests under threshold
- Throughput: Requests per second
SLO (Service Level Objective): Target for SLI
- 99.9% availability (43.8 minutes downtime/month)
- 95% of requests < 200ms
- 1000 RPS sustained
SLA (Service Level Agreement): Contract with consequences
- External commitments to customers
- Financial penalties for SLO violations
Error Budget: Acceptable failure rate
- Error budget = 100% - SLO
- 99.9% SLO = 0.1% error budget
- Use budget for innovation vs. reliability tradeoff
Prometheus Setup and Configuration
Basic Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Alert rule evaluation interval
external_labels:
cluster: 'production'
region: 'us-west-2'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules
rule_files:
- 'rules/*.yml'
- 'alerts/*.yml'
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'
# Application metrics
- job_name: 'api'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
labels:
env: 'production'
tier: 'backend'
Kubernetes Service Discovery
scrape_configs:
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape: "true" annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the port from prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (\d+)
target_label: __address__
replacement: ${1}:${2}
# Use the path from prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Kubernetes services
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance
Storage and Retention
# Storage configuration
storage:
tsdb:
path: /prometheus/data
retention.time: 15d
retention.size: 50GB
# Remote write for long-term storage
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
queue_config:
capacity: 10000
max_shards: 50
max_samples_per_send: 5000
write_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'container_network_.*'
action: drop
# Remote read for querying historical data
remote_read:
- url: "https://prometheus-remote-storage.example.com/api/v1/read"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
read_recent: true
PromQL: The Prometheus Query Language
Instant Vectors and Selectors
# Basic metric selection
http_requests_total
# Filter by label
http_requests_total{job="api", status="200"}
# Regex matching
http_requests_total{status=~"2..|3.."}
# Negative matching
http_requests_total{status!="500"}
# Multiple label matchers
http_requests_total{job="api", method="GET", status=~"2.."}
Range Vectors and Aggregations
# 5-minute range vector
http_requests_total[5m]
# Rate of increase per second
rate(http_requests_total[5m])
# Increase over time window
increase(http_requests_total[1h])
# Average over time
avg_over_time(cpu_usage[5m])
# Max/Min over time
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])
# Standard deviation
stddev_over_time(response_time_seconds[5m])
Aggregation Operators
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum grouped by job
sum by (job) (rate(http_requests_total[5m]))
# Average grouped by multiple labels
avg by (job, instance) (cpu_usage)
# Count number of series
count(up == 1)
# Topk and bottomk
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)
# Quantile across instances
quantile(0.95, http_request_duration_seconds)
Mathematical Operations
# Arithmetic operations
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
# Comparison operators
http_request_duration_seconds > 0.5
# Logical operators
up == 1 and rate(http_requests_total[5m]) > 100
# Vector matching
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
Advanced PromQL Patterns
# Request success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Latency percentiles (histogram)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Predict linear growth
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
# Detect anomalies with standard deviation
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
>
3 * stddev_over_time(cpu_usage[1h])
# Calculate saturation (RED method)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)
# Burn rate for SLO
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
>
(14.4 * (1 - 0.999)) # For 99.9% SLO
Alerting with Prometheus and Alertmanager
Alert Rule Definitions
# alerts/api_alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
# High error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
runbook_url: "https://runbooks.example.com/HighErrorRate"
# High latency alert
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
# Service down alert
- alert: ServiceDown
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
team: sre
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
# Disk space alert
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: warning
team: sre
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
# Memory pressure alert
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
team: sre
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
# CPU saturation alert
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: warning
team: sre
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Templates for notifications
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route tree for alert distribution
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# Critical alerts also go to Slack
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 0s
# Warning alerts to Slack only
- match:
severity: warning
receiver: 'slack-warnings'
# Team-specific routing
- match:
team: backend
receiver: 'team-backend'
- match:
team: frontend
receiver: 'team-frontend'
# Database alerts to DBA team
- match_re:
service: 'postgres|mysql|mongodb'
receiver: 'team-dba'
# Alert receivers/integrations
receivers:
- name: 'team-default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
severity: '{{ .CommonLabels.severity }}'
- name: 'slack-critical'
slack_configs:
- channel: '#incidents'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
color: 'danger'
send_resolved: true
- name: 'slack-warnings'
slack_configs:
- channel: '#monitoring'
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'warning'
send_resolved: true
- name: 'team-backend'
slack_configs:
- channel: '#team-backend'
send_resolved: true
email_configs:
- to: 'backend-team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alertmanager@example.com'
auth_password_file: '/etc/alertmanager/email_password'
- name: 'team-frontend'
slack_configs:
- channel: '#team-frontend'
send_resolved: true
- name: 'team-dba'
slack_configs:
- channel: '#team-dba'
send_resolved: true
pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'
# Inhibition rules (suppress alerts)
inhibit_rules:
# Inhibit warnings if critical alert is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Don't alert on instance down if cluster is down
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: 'InstanceDown|ServiceDown'
equal: ['cluster']
Multi-Window Multi-Burn-Rate Alerts for SLOs
# SLO-based alerting using burn rate
groups:
- name: slo_alerts
interval: 30s
rules:
# Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * (1 - 0.999))
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
slo: "99.9%"
annotations:
summary: "Fast error budget burn detected"
description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
# Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * (1 - 0.999))
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
) > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
slo: "99.9%"
annotations:
summary: "Slow error budget burn detected"
description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
Grafana Dashboards and Visualization
Dashboard JSON Structure
{
"dashboard": {
"title": "API Performance Dashboard",
"tags": ["api", "performance", "production"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
},
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster=\"$cluster\"}, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
"legendFormat": "{{ service }}",
"refId": "A"
}
],
"yaxes": [
{"format": "reqps", "label": "Requests/sec"},
{"format": "short"}
],
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
"legendFormat": "{{ service }} error %",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "label": "Error Rate"},
{"format": "short"}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"name": "High Error Rate",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "Latency Percentiles",
"type": "graph",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p99",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p50",
"refId": "C"
}
],
"yaxes": [
{"format": "s", "label": "Duration"},
{"format": "short"}
]
}
]
}
}
RED Method Dashboard
The RED method focuses on Request rate, Error rate, and Duration:
{
"panels": [
{
"title": "Request Rate (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
}
]
},
{
"title": "Error Rate % (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
}
]
},
{
"title": "Duration p99 (per service)",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
}
]
}
]
}
USE Method Dashboard
The USE method monitors Utilization, Saturation, and Errors:
{
"panels": [
{
"title": "CPU Utilization %",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"targets": [
{
"expr": "node_load1"
}
]
},
{
"title": "Memory Utilization %",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
},
{
"title": "Disk I/O Utilization %",
"targets": [
{
"expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
}
]
},
{
"title": "Network Errors",
"targets": [
{
"expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
}
]
}
]
}
Exporters and Metric Collection
Node Exporter for System Metrics
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100" \
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)" \
--collector.netclass.ignored-devices="^(veth.*|br.*|docker.*|lo)$"
Key Metrics from Node Exporter:
node_cpu_seconds_total: CPU usage by modenode_memory_MemTotal_bytes,node_memory_MemAvailable_bytes: Memorynode_disk_io_time_seconds_total: Disk I/Onode_network_receive_bytes_total,node_network_transmit_bytes_total: Networknode_filesystem_size_bytes,node_filesystem_avail_bytes: Disk space
Custom Application Exporter (Python)
# app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random
# Define metrics
REQUEST_COUNT = Counter(
'app_requests_total',
'Total app requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'Request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)
ACTIVE_USERS = Gauge(
'app_active_users',
'Number of active users'
)
QUEUE_SIZE = Gauge(
'app_queue_size',
'Current queue size',
['queue_name']
)
DATABASE_CONNECTIONS = Gauge(
'app_database_connections',
'Number of database connections',
['pool', 'state']
)
CACHE_HITS = Counter(
'app_cache_hits_total',
'Total cache hits',
['cache_name']
)
CACHE_MISSES = Counter(
'app_cache_misses_total',
'Total cache misses',
['cache_name']
)
def simulate_metrics():
"""Simulate application metrics"""
while True:
# Simulate requests
method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
status = random.choice(['200', '200', '200', '400', '500'])
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
# Simulate request duration
duration = random.uniform(0.01, 2.0)
REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)
# Update gauges
ACTIVE_USERS.set(random.randint(100, 1000))
QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))
# Database connection pool
DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))
# Cache metrics
if random.random() > 0.3:
CACHE_HITS.labels(cache_name='redis').inc()
else:
CACHE_MISSES.labels(cache_name='redis').inc()
time.sleep(1)
if __name__ == '__main__':
# Start metrics server on port 8000
start_http_server(8000)
print("Metrics server started on port 8000")
simulate_metrics()
Custom Exporter (Go)
package main
import (
"log"
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_requests_total",
Help: "Total number of requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "Request duration in seconds",
Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
},
[]string{"method", "endpoint"},
)
activeUsers = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_active_users",
Help: "Number of active users",
},
)
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_database_connections",
Help: "Number of database connections",
},
[]string{"pool", "state"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeUsers)
prometheus.MustRegister(databaseConnections)
}
func simulateMetrics() {
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
for range ticker.C {
// Simulate requests
methods := []string{"GET", "POST", "PUT", "DELETE"}
endpoints := []string{"/api/users", "/api/products", "/api/orders"}
statuses := []string{"200", "200", "200", "400", "500"}
method := methods[rand.Intn(len(methods))]
endpoint := endpoints[rand.Intn(len(endpoints))]
status := statuses[rand.Intn(len(statuses))]
requestsTotal.WithLabelValues(method, endpoint, status).Inc()
requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)
// Update gauges
activeUsers.Set(float64(rand.Intn(900) + 100))
databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}
}
func main() {
go simulateMetrics()
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))
}
PostgreSQL Exporter
# docker-compose.yml for postgres_exporter
version: '3.8'
services:
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
ports:
- "9187:9187"
command:
- '--collector.stat_statements'
- '--collector.stat_database'
- '--collector.replication'
Key PostgreSQL Metrics:
pg_up: Database reachabilitypg_stat_database_tup_returned: Rows readpg_stat_database_tup_inserted: Rows insertedpg_stat_database_deadlocks: Deadlock countpg_stat_replication_lag: Replication lag in secondspg_locks_count: Active locks
Blackbox Exporter for Probing
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"
http_post_json:
prober: http
http:
method: POST
headers:
Content-Type: application/json
body: '{"key":"value"}'
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
# Prometheus config for blackbox exporter
scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Best Practices
Metric Naming Conventions
Follow Prometheus naming best practices:
# Format: <namespace>_<subsystem>_<metric>_<unit>
# Good examples
http_requests_total # Counter
http_request_duration_seconds # Histogram
database_connections_active # Gauge
cache_hits_total # Counter
memory_usage_bytes # Gauge
# Include unit suffixes
_seconds, _bytes, _total, _ratio, _percentage
# Avoid
RequestCount # Use snake_case
http_requests # Missing _total for counter
request_time # Missing unit (should be _seconds)
Label Guidelines
# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}
# Bad: High cardinality labels (avoid)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}
# Good: Use bounded label values
http_requests_total{status_class="2xx"}
# Bad: Unbounded label values
http_requests_total{response_size="1234567"}
Recording Rule Patterns
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-aggregate expensive queries
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Namespace aggregations
- record: namespace:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (namespace)
# SLI calculations
- record: job:http_requests_success:rate5m
expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
- record: job:http_requests_error_rate:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
Alert Design Principles
- Alert on symptoms, not causes: Alert on user-facing issues
- Make alerts actionable: Include runbook links
- Use appropriate severity levels: Critical, warning, info
- Set proper thresholds: Based on historical data
- Include context in annotations: Help on-call engineers
- Group related alerts: Reduce alert fatigue
- Use inhibition rules: Suppress redundant alerts
- Test alert rules: Verify they fire when expected
Dashboard Best Practices
- One dashboard per audience: SRE, developers, business
- Use consistent time ranges: Make comparisons easier
- Include SLI/SLO metrics: Show business impact
- Add annotations for deploys: Correlate changes with metrics
- Use template variables: Make dashboards reusable
- Show trends and aggregates: Not just raw metrics
- Include links to runbooks: Enable quick response
- Use appropriate visualizations: Graphs, gauges, tables
High Availability Setup
# Prometheus HA with Thanos
# Deploy multiple Prometheus instances with same config
# Use Thanos to deduplicate and provide global view
# prometheus-1.yml
global:
external_labels:
cluster: 'prod'
replica: '1'
# prometheus-2.yml
global:
external_labels:
cluster: 'prod'
replica: '2'
# Thanos sidecar configuration
# Uploads blocks to object storage
# Provides StoreAPI for querying
Capacity Planning Queries
# Disk space exhaustion prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
# Memory growth trend
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
# Request rate growth
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
# Storage capacity planning
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
Advanced Patterns
Federation for Multi-Cluster Monitoring
# Global Prometheus federating from cluster Prometheus instances
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # All recording rules
static_configs:
- targets:
- 'prometheus-us-west:9090'
- 'prometheus-us-east:9090'
- 'prometheus-eu-central:9090'
Cost Monitoring Pattern
# Track cloud costs with custom metrics
groups:
- name: cost_tracking
rules:
- record: cloud:cost:hourly_rate
expr: |
(
sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour
+
sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour
)
- record: cloud:cost:monthly_estimate
expr: cloud:cost:hourly_rate * 730 # Hours in average month
Custom SLO Implementation
# SLO: 99.9% availability for API
groups:
- name: api_slo
interval: 30s
rules:
# Success rate SLI
- record: api:sli:success_rate
expr: |
sum(rate(http_requests_total{job="api",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
# Error budget remaining (30 days)
- record: api:error_budget:remaining
expr: |
1 - (
(1 - api:sli:success_rate)
/
(1 - 0.999)
)
# Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate
expr: |
(
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
) < 0.5
)
Examples Summary
This skill includes 20+ comprehensive examples covering:
- Prometheus configuration (basic, Kubernetes SD, storage)
- PromQL queries (instant vectors, range vectors, aggregations)
- Mathematical operations and advanced patterns
- Alert rule definitions (error rate, latency, resource usage)
- Alertmanager configuration (routing, receivers, inhibition)
- Multi-window multi-burn-rate SLO alerts
- Grafana dashboard JSON (full dashboard, RED method, USE method)
- Custom exporters (Python, Go)
- Third-party exporters (PostgreSQL, Blackbox)
- Recording rules for performance
- Federation for multi-cluster monitoring
- Cost monitoring and SLO implementation
- High availability patterns
- Capacity planning queries
Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes