name	observability-monitoring
description	Comprehensive observability and monitoring skill covering Prometheus, Grafana, metrics collection, alerting, exporters, PromQL, and production monitoring patterns for distributed systems and cloud-native applications

Observability & Monitoring

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.

When to Use This Skill

Use this skill when:

Setting up monitoring for production systems and applications
Implementing metrics collection and observability for microservices
Creating dashboards and visualizations for system health monitoring
Defining alerting rules and incident response automation
Analyzing system performance and capacity using time-series data
Implementing SLIs, SLOs, and SLAs for service reliability
Debugging production issues using metrics and traces
Building custom exporters for application-specific metrics
Setting up federation for multi-cluster monitoring
Migrating from legacy monitoring to cloud-native solutions
Implementing cost monitoring and optimization tracking
Creating real-time operational dashboards for DevOps teams

Core Concepts

The Four Pillars of Observability

Modern observability is built on four fundamental pillars:

Metrics: Numerical measurements of system behavior over time
- Counter: Monotonically increasing values (requests served, errors)
- Gauge: Point-in-time values that go up and down (memory usage, temperature)
- Histogram: Distribution of values (request duration buckets)
- Summary: Similar to histogram but calculates quantiles on client-side
Logs: Discrete events with contextual information
- Structured logging (JSON, key-value pairs)
- Centralized log aggregation (ELK, Loki)
- Correlation with metrics and traces
Traces: Request flow through distributed systems
- Span: Single unit of work with start/end time
- Trace: Collection of spans representing end-to-end request
- OpenTelemetry for distributed tracing
Events: Significant occurrences in system lifecycle
- Deployments, configuration changes
- Scaling events, incidents
- Business events and user actions

Prometheus Architecture

Prometheus is a pull-based monitoring system with key components:

Time-Series Database (TSDB)

Stores metrics as time-series data
Efficient compression and retention policies
Local storage with optional remote storage

Scrape Targets

Service discovery (Kubernetes, Consul, EC2, etc.)
Static configuration
Relabeling for flexible target selection

PromQL Query Engine

Powerful query language for metrics analysis
Aggregation, filtering, and mathematical operations
Range vectors and instant vectors

Alertmanager

Alert rule evaluation
Grouping, silencing, and routing
Integration with PagerDuty, Slack, email, webhooks

Exporters

Bridge between Prometheus and systems
Node exporter, cAdvisor, custom exporters
Third-party exporters for databases, services

Metric Labels and Cardinality

Labels are key-value pairs attached to metrics:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Label Best Practices:

Use labels for dimensions you query/aggregate by
Avoid high-cardinality labels (user IDs, timestamps)
Keep label names consistent across metrics
Use relabeling to normalize external labels

Cardinality Considerations:

Each unique label combination = new time-series
High cardinality = increased memory and storage
Monitor cardinality with prometheus_tsdb_symbol_table_size_bytes
Use recording rules to pre-aggregate high-cardinality metrics

Recording Rules

Pre-compute frequently-used or expensive queries:

groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])

Benefits:

Faster dashboard loading
Reduced query load on Prometheus
Consistent metric naming conventions
Enable complex aggregations

Service Level Objectives (SLOs)

Define and track reliability targets:

SLI (Service Level Indicator): Metric measuring service quality

Availability: % of successful requests
Latency: % of requests under threshold
Throughput: Requests per second

SLO (Service Level Objective): Target for SLI

99.9% availability (43.8 minutes downtime/month)
95% of requests < 200ms
1000 RPS sustained

SLA (Service Level Agreement): Contract with consequences

External commitments to customers
Financial penalties for SLO violations

Error Budget: Acceptable failure rate

Error budget = 100% - SLO
99.9% SLO = 0.1% error budget
Use budget for innovation vs. reliability tradeoff

Prometheus Setup and Configuration

Basic Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s      # Default scrape interval
  evaluation_interval: 15s  # Alert rule evaluation interval
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules
rule_files:
  - 'rules/*.yml'
  - 'alerts/*.yml'

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '${1}'

  # Application metrics
  - job_name: 'api'
    static_configs:
      - targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
        labels:
          env: 'production'
          tier: 'backend'

Kubernetes Service Discovery

scrape_configs:
  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true" annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port from prometheus.io/port annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # Use the path from prometheus.io/path annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes services
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance

Storage and Retention

# Storage configuration
storage:
  tsdb:
    path: /prometheus/data
    retention.time: 15d
    retention.size: 50GB

# Remote write for long-term storage
remote_write:
  - url: "https://prometheus-remote-storage.example.com/api/v1/write"
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/remote_storage_password
    queue_config:
      capacity: 10000
      max_shards: 50
      max_samples_per_send: 5000
    write_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: 'container_network_.*'
        action: drop

# Remote read for querying historical data
remote_read:
  - url: "https://prometheus-remote-storage.example.com/api/v1/read"
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/remote_storage_password
    read_recent: true

PromQL: The Prometheus Query Language

Instant Vectors and Selectors

# Basic metric selection
http_requests_total

# Filter by label
http_requests_total{job="api", status="200"}

# Regex matching
http_requests_total{status=~"2..|3.."}

# Negative matching
http_requests_total{status!="500"}

# Multiple label matchers
http_requests_total{job="api", method="GET", status=~"2.."}

Range Vectors and Aggregations

# 5-minute range vector
http_requests_total[5m]

# Rate of increase per second
rate(http_requests_total[5m])

# Increase over time window
increase(http_requests_total[1h])

# Average over time
avg_over_time(cpu_usage[5m])

# Max/Min over time
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])

# Standard deviation
stddev_over_time(response_time_seconds[5m])

Aggregation Operators

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum grouped by job
sum by (job) (rate(http_requests_total[5m]))

# Average grouped by multiple labels
avg by (job, instance) (cpu_usage)

# Count number of series
count(up == 1)

# Topk and bottomk
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)

# Quantile across instances
quantile(0.95, http_request_duration_seconds)

Mathematical Operations

# Arithmetic operations
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

# Comparison operators
http_request_duration_seconds > 0.5

# Logical operators
up == 1 and rate(http_requests_total[5m]) > 100

# Vector matching
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])

Advanced PromQL Patterns

# Request success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Latency percentiles (histogram)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Predict linear growth
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

# Detect anomalies with standard deviation
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
>
3 * stddev_over_time(cpu_usage[1h])

# Calculate saturation (RED method)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)

# Burn rate for SLO
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
)
>
(14.4 * (1 - 0.999))  # For 99.9% SLO

Alerting with Prometheus and Alertmanager

Alert Rule Definitions

# alerts/api_alerts.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # High error rate alert
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
          runbook_url: "https://runbooks.example.com/HighErrorRate"

      # High latency alert
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s on {{ $labels.service }}"

      # Service down alert
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 2m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"

      # Disk space alert
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"}
          /
          node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"

      # Memory pressure alert
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"

      # CPU saturation alert
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 15m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Templates for notifications
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# Route tree for alert distribution
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-default'

  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true

    # Critical alerts also go to Slack
    - match:
        severity: critical
      receiver: 'slack-critical'
      group_wait: 0s

    # Warning alerts to Slack only
    - match:
        severity: warning
      receiver: 'slack-warnings'

    # Team-specific routing
    - match:
        team: backend
      receiver: 'team-backend'

    - match:
        team: frontend
      receiver: 'team-frontend'

    # Database alerts to DBA team
    - match_re:
        service: 'postgres|mysql|mongodb'
      receiver: 'team-dba'

# Alert receivers/integrations
receivers:
  - name: 'team-default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
        severity: '{{ .CommonLabels.severity }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#incidents'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        color: 'danger'
        send_resolved: true

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#monitoring'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'warning'
        send_resolved: true

  - name: 'team-backend'
    slack_configs:
      - channel: '#team-backend'
        send_resolved: true
    email_configs:
      - to: 'backend-team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password_file: '/etc/alertmanager/email_password'

  - name: 'team-frontend'
    slack_configs:
      - channel: '#team-frontend'
        send_resolved: true

  - name: 'team-dba'
    slack_configs:
      - channel: '#team-dba'
        send_resolved: true
    pagerduty_configs:
      - service_key: 'DBA_PAGERDUTY_KEY'

# Inhibition rules (suppress alerts)
inhibit_rules:
  # Inhibit warnings if critical alert is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # Don't alert on instance down if cluster is down
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: 'InstanceDown|ServiceDown'
    equal: ['cluster']

Multi-Window Multi-Burn-Rate Alerts for SLOs

# SLO-based alerting using burn rate
groups:
  - name: slo_alerts
    interval: 30s
    rules:
      # Fast burn (1h window, 5m burn)
      - alert: ErrorBudgetBurnFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * (1 - 0.999))
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
          slo: "99.9%"
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"

      # Slow burn (6h window, 30m burn)
      - alert: ErrorBudgetBurnSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * (1 - 0.999))
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
          slo: "99.9%"
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"

Grafana Dashboards and Visualization

Dashboard JSON Structure

{
  "dashboard": {
    "title": "API Performance Dashboard",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "Requests/sec"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} error %",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "Error Rate"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "Latency Percentiles",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "Duration"},
          {"format": "short"}
        ]
      }
    ]
  }
}

RED Method Dashboard

The RED method focuses on Request rate, Error rate, and Duration:

{
  "panels": [
    {
      "title": "Request Rate (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "Error Rate % (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "Duration p99 (per service)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}

USE Method Dashboard

The USE method monitors Utilization, Saturation, and Errors:

{
  "panels": [
    {
      "title": "CPU Utilization %",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU Saturation (Load Average)",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "Memory Utilization %",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "Disk I/O Utilization %",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "Network Errors",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}

Exporters and Metric Collection

Node Exporter for System Metrics

# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100" \
  --collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)" \
  --collector.netclass.ignored-devices="^(veth.*|br.*|docker.*|lo)$"

Key Metrics from Node Exporter:

node_cpu_seconds_total: CPU usage by mode
node_memory_MemTotal_bytes, node_memory_MemAvailable_bytes: Memory
node_disk_io_time_seconds_total: Disk I/O
node_network_receive_bytes_total, node_network_transmit_bytes_total: Network
node_filesystem_size_bytes, node_filesystem_avail_bytes: Disk space

Custom Application Exporter (Python)

# app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random

# Define metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total app requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)

ACTIVE_USERS = Gauge(
    'app_active_users',
    'Number of active users'
)

QUEUE_SIZE = Gauge(
    'app_queue_size',
    'Current queue size',
    ['queue_name']
)

DATABASE_CONNECTIONS = Gauge(
    'app_database_connections',
    'Number of database connections',
    ['pool', 'state']
)

CACHE_HITS = Counter(
    'app_cache_hits_total',
    'Total cache hits',
    ['cache_name']
)

CACHE_MISSES = Counter(
    'app_cache_misses_total',
    'Total cache misses',
    ['cache_name']
)

def simulate_metrics():
    """Simulate application metrics"""
    while True:
        # Simulate requests
        method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
        endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
        status = random.choice(['200', '200', '200', '400', '500'])

        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

        # Simulate request duration
        duration = random.uniform(0.01, 2.0)
        REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

        # Update gauges
        ACTIVE_USERS.set(random.randint(100, 1000))
        QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
        QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

        # Database connection pool
        DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
        DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

        # Cache metrics
        if random.random() > 0.3:
            CACHE_HITS.labels(cache_name='redis').inc()
        else:
            CACHE_MISSES.labels(cache_name='redis').inc()

        time.sleep(1)

if __name__ == '__main__':
    # Start metrics server on port 8000
    start_http_server(8000)
    print("Metrics server started on port 8000")
    simulate_metrics()

Custom Exporter (Go)

package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total number of requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "Number of active users",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "Number of database connections",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // Simulate requests
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // Update gauges
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}

PostgreSQL Exporter

# docker-compose.yml for postgres_exporter
version: '3.8'
services:
  postgres-exporter:
    image: prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
    ports:
      - "9187:9187"
    command:
      - '--collector.stat_statements'
      - '--collector.stat_database'
      - '--collector.replication'

Key PostgreSQL Metrics:

pg_up: Database reachability
pg_stat_database_tup_returned: Rows read
pg_stat_database_tup_inserted: Rows inserted
pg_stat_database_deadlocks: Deadlock count
pg_stat_replication_lag: Replication lag in seconds
pg_locks_count: Active locks

Blackbox Exporter for Probing

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      preferred_ip_protocol: "ip4"

  http_post_json:
    prober: http
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"key":"value"}'
      valid_status_codes: [200, 201]

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

# Prometheus config for blackbox exporter
scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Best Practices

Metric Naming Conventions

Follow Prometheus naming best practices:

# Format: <namespace>_<subsystem>_<metric>_<unit>

# Good examples
http_requests_total              # Counter
http_request_duration_seconds    # Histogram
database_connections_active      # Gauge
cache_hits_total                 # Counter
memory_usage_bytes              # Gauge

# Include unit suffixes
_seconds, _bytes, _total, _ratio, _percentage

# Avoid
RequestCount                     # Use snake_case
http_requests                    # Missing _total for counter
request_time                     # Missing unit (should be _seconds)

Label Guidelines

# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}

# Bad: High cardinality labels (avoid)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}

# Good: Use bounded label values
http_requests_total{status_class="2xx"}

# Bad: Unbounded label values
http_requests_total{response_size="1234567"}

Recording Rule Patterns

groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-aggregate expensive queries
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Namespace aggregations
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI calculations
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

Alert Design Principles

Alert on symptoms, not causes: Alert on user-facing issues
Make alerts actionable: Include runbook links
Use appropriate severity levels: Critical, warning, info
Set proper thresholds: Based on historical data
Include context in annotations: Help on-call engineers
Group related alerts: Reduce alert fatigue
Use inhibition rules: Suppress redundant alerts
Test alert rules: Verify they fire when expected

Dashboard Best Practices

One dashboard per audience: SRE, developers, business
Use consistent time ranges: Make comparisons easier
Include SLI/SLO metrics: Show business impact
Add annotations for deploys: Correlate changes with metrics
Use template variables: Make dashboards reusable
Show trends and aggregates: Not just raw metrics
Include links to runbooks: Enable quick response
Use appropriate visualizations: Graphs, gauges, tables

High Availability Setup

# Prometheus HA with Thanos
# Deploy multiple Prometheus instances with same config
# Use Thanos to deduplicate and provide global view

# prometheus-1.yml
global:
  external_labels:
    cluster: 'prod'
    replica: '1'

# prometheus-2.yml
global:
  external_labels:
    cluster: 'prod'
    replica: '2'

# Thanos sidecar configuration
# Uploads blocks to object storage
# Provides StoreAPI for querying

Capacity Planning Queries

# Disk space exhaustion prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Memory growth trend
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

# Request rate growth
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

# Storage capacity planning
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)

Advanced Patterns

Federation for Multi-Cluster Monitoring

# Global Prometheus federating from cluster Prometheus instances
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'  # All recording rules
    static_configs:
      - targets:
          - 'prometheus-us-west:9090'
          - 'prometheus-us-east:9090'
          - 'prometheus-eu-central:9090'

Cost Monitoring Pattern

# Track cloud costs with custom metrics
groups:
  - name: cost_tracking
    rules:
      - record: cloud:cost:hourly_rate
        expr: |
          (
            sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03  # CPU cost/hour
            +
            sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005  # Memory cost/hour
          )

      - record: cloud:cost:monthly_estimate
        expr: cloud:cost:hourly_rate * 730  # Hours in average month

Custom SLO Implementation

# SLO: 99.9% availability for API
groups:
  - name: api_slo
    interval: 30s
    rules:
      # Success rate SLI
      - record: api:sli:success_rate
        expr: |
          sum(rate(http_requests_total{job="api",status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total{job="api"}[5m]))

      # Error budget remaining (30 days)
      - record: api:error_budget:remaining
        expr: |
          1 - (
            (1 - api:sli:success_rate)
            /
            (1 - 0.999)
          )

      # Latency SLI (p99 < 500ms)
      - record: api:sli:latency_success_rate
        expr: |
          (
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
            ) < 0.5
          )

Examples Summary

This skill includes 20+ comprehensive examples covering:

Prometheus configuration (basic, Kubernetes SD, storage)
PromQL queries (instant vectors, range vectors, aggregations)
Mathematical operations and advanced patterns
Alert rule definitions (error rate, latency, resource usage)
Alertmanager configuration (routing, receivers, inhibition)
Multi-window multi-burn-rate SLO alerts
Grafana dashboard JSON (full dashboard, RED method, USE method)
Custom exporters (Python, Go)
Third-party exporters (PostgreSQL, Blackbox)
Recording rules for performance
Federation for multi-cluster monitoring
Cost monitoring and SLO implementation
High availability patterns
Capacity planning queries

Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes

Install Skill

SKILL.md