name	check-metrics
description	Query Prometheus metrics, check resource usage, and analyze platform performance in the Kagenti platform

Check Metrics Skill

This skill helps you query Prometheus metrics and analyze platform performance.

When to Use

User asks about resource usage (CPU, memory, disk)
Investigating performance issues
Checking service health metrics
After deployments to verify metrics collection
Analyzing platform capacity and scaling needs

What This Skill Does

Query Metrics: Execute PromQL queries against Prometheus
Resource Usage: Check CPU, memory, disk usage
Service Health: Verify service metrics and availability
Performance Analysis: Analyze request rates, latency, errors
Capacity Planning: Review resource trends

Examples

Access Prometheus UI

Prometheus UI: Port-forward to access locally

kubectl port-forward -n observability svc/prometheus 9090:9090 &
# Open http://localhost:9090

Grafana Explore: https://grafana.localtest.me:9443/explore

Select Prometheus datasource
Enter PromQL queries

Query Metrics via CLI

# Basic query
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=up' | python3 -m json.tool

# Query with time range
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query_range' \
  --data-urlencode 'query=rate(container_cpu_usage_seconds_total[5m])' \
  --data-urlencode 'start='$(date -u -v-1H +%s) \
  --data-urlencode 'end='$(date -u +%s) \
  --data-urlencode 'step=60' | python3 -m json.tool

Common PromQL Queries

Service Health

# Check if services are up
up{job="kubernetes-pods"}

# Count running pods by namespace
count by (kubernetes_namespace) (up == 1)

# Check deployment replicas
kube_deployment_status_replicas_available

# Check StatefulSet replicas
kube_statefulset_status_replicas_ready

CPU Usage

# Pod CPU usage (percentage of limit)
sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota{container!="",container!="POD"} / container_spec_cpu_period{container!="",container!="POD"}) by (namespace, pod, container) * 100

# Node CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Top CPU consuming pods
topk(10,
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod)
)

Memory Usage

# Pod memory usage (percentage of limit)
sum(container_memory_working_set_bytes{container!="",container!="POD"}) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes{container!="",container!="POD"}) by (namespace, pod, container) * 100

# Pod memory usage in bytes
container_memory_working_set_bytes{container!="",container!="POD"}

# Top memory consuming pods
topk(10,
  sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod)
)

Network Traffic

# Network receive rate
rate(container_network_receive_bytes_total[5m])

# Network transmit rate
rate(container_network_transmit_bytes_total[5m])

# Total network I/O by pod
sum by (pod) (
  rate(container_network_receive_bytes_total[5m]) +
  rate(container_network_transmit_bytes_total[5m])
)

Disk Usage

# Filesystem usage percentage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

# PVC usage by namespace
sum by (namespace, persistentvolumeclaim) (
  kubelet_volume_stats_used_bytes
)

# Disk I/O rate
rate(container_fs_writes_bytes_total[5m])

Pod Status

# Pods not running
kube_pod_status_phase{phase!="Running"}

# Pod restart count
kube_pod_container_status_restarts_total

# Pods waiting (pending)
kube_pod_status_phase{phase="Pending"}

# Pods in crash loop
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

Request Metrics (if instrumented)

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Request latency (p95)
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Check Specific Components

Prometheus Metrics

# Check Prometheus scrape targets
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/targets' | python3 -m json.tool

# Prometheus storage size
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/status/tsdb' | python3 -m json.tool

Grafana Metrics

# Grafana datasource queries
grafana_datasource_request_total

# Grafana dashboard loads
grafana_page_response_status_total

Keycloak Metrics (if exposed)

# Keycloak sessions
keycloak_sessions

# Keycloak login failures
keycloak_failed_login_attempts

Istio Metrics

# Istio requests
istio_requests_total

# Istio request duration
histogram_quantile(0.95,
  rate(istio_request_duration_milliseconds_bucket[5m])
)

# Istio error rate
rate(istio_requests_total{response_code=~"5.."}[5m])

Resource Monitoring via kubectl

Quick Resource Check

# Node resources
kubectl top nodes

# Pod resources (all namespaces)
kubectl top pods -A --sort-by=memory

# Pod resources (specific namespace)
kubectl top pods -n observability --sort-by=cpu

# Container resources in pod
kubectl top pod <pod-name> -n <namespace> --containers

Resource Limits and Requests

# Show resource requests/limits for deployment
kubectl describe deployment <name> -n <namespace> | grep -A 5 "Limits\|Requests"

# Show all pod resource requests
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

Grafana Dashboards

Access: https://grafana.localtest.me:9443/dashboards

Key Dashboards:

Kubernetes / Compute Resources / Cluster - Overall cluster metrics
Kubernetes / Compute Resources / Namespace (Pods) - Per-namespace pod resources
Kubernetes / Compute Resources / Pod - Individual pod metrics
Prometheus - Prometheus self-monitoring
Loki Logs - Log volume and patterns
Istio Mesh - Service mesh metrics

Create Custom Queries in Grafana

Navigate to Explore (compass icon in sidebar)
Select Prometheus datasource
Enter PromQL query
Click "Run query"
Optionally save to dashboard

Troubleshooting with Metrics

Issue: High CPU Usage

# Find pods using >80% CPU
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota / container_spec_cpu_period) by (namespace, pod, container) * 100 > 80

Issue: High Memory Usage

# Find pods using >80% memory
sum(container_memory_working_set_bytes) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes) by (namespace, pod, container) * 100 > 80

Issue: Service Not Responding

# Check if service endpoints are up
up{job="kubernetes-service-endpoints"}

# Check scrape failures
up == 0

Issue: Disk Full

# Find PVCs >80% full
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80

Alert Query Testing

When investigating alerts, test the PromQL query:

# Get alert query from Grafana
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
  -u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down'  # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
    query = rule['data'][0]['model']['expr']
    print(f'Query: {query}')
"

# Test the query
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode "query=<QUERY_FROM_ABOVE>" | python3 -m json.tool

Metrics Collection Issues

Check if Metrics Are Being Scraped

# Check last scrape time
time() - timestamp(up)

# Check scrape duration
scrape_duration_seconds

Verify Metric Exists

# List all metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | python3 -m json.tool

# Search for specific metric
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | grep "your_metric"

Pro Tips

Use rate() for counters: rate(metric[5m]) instead of raw counter values
Aggregate with by/without: sum by (namespace) (metric) to group metrics
Use recording rules: For frequently used complex queries
Set appropriate time ranges: Use [5m] for rate calculations
Test queries in Explore first: Before adding to dashboards or alerts

🤖 Generated with Claude Code

check-metrics

Install Skill

SKILL.md