Claude Code Plugins

Community-maintained marketplace

Feedback

Query Prometheus metrics, check resource usage, and analyze platform performance in the Kagenti platform

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name check-metrics
description Query Prometheus metrics, check resource usage, and analyze platform performance in the Kagenti platform

Check Metrics Skill

This skill helps you query Prometheus metrics and analyze platform performance.

When to Use

  • User asks about resource usage (CPU, memory, disk)
  • Investigating performance issues
  • Checking service health metrics
  • After deployments to verify metrics collection
  • Analyzing platform capacity and scaling needs

What This Skill Does

  1. Query Metrics: Execute PromQL queries against Prometheus
  2. Resource Usage: Check CPU, memory, disk usage
  3. Service Health: Verify service metrics and availability
  4. Performance Analysis: Analyze request rates, latency, errors
  5. Capacity Planning: Review resource trends

Examples

Access Prometheus UI

Prometheus UI: Port-forward to access locally

kubectl port-forward -n observability svc/prometheus 9090:9090 &
# Open http://localhost:9090

Grafana Explore: https://grafana.localtest.me:9443/explore

  • Select Prometheus datasource
  • Enter PromQL queries

Query Metrics via CLI

# Basic query
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode 'query=up' | python3 -m json.tool

# Query with time range
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query_range' \
  --data-urlencode 'query=rate(container_cpu_usage_seconds_total[5m])' \
  --data-urlencode 'start='$(date -u -v-1H +%s) \
  --data-urlencode 'end='$(date -u +%s) \
  --data-urlencode 'step=60' | python3 -m json.tool

Common PromQL Queries

Service Health

# Check if services are up
up{job="kubernetes-pods"}

# Count running pods by namespace
count by (kubernetes_namespace) (up == 1)

# Check deployment replicas
kube_deployment_status_replicas_available

# Check StatefulSet replicas
kube_statefulset_status_replicas_ready

CPU Usage

# Pod CPU usage (percentage of limit)
sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota{container!="",container!="POD"} / container_spec_cpu_period{container!="",container!="POD"}) by (namespace, pod, container) * 100

# Node CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Top CPU consuming pods
topk(10,
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod)
)

Memory Usage

# Pod memory usage (percentage of limit)
sum(container_memory_working_set_bytes{container!="",container!="POD"}) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes{container!="",container!="POD"}) by (namespace, pod, container) * 100

# Pod memory usage in bytes
container_memory_working_set_bytes{container!="",container!="POD"}

# Top memory consuming pods
topk(10,
  sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod)
)

Network Traffic

# Network receive rate
rate(container_network_receive_bytes_total[5m])

# Network transmit rate
rate(container_network_transmit_bytes_total[5m])

# Total network I/O by pod
sum by (pod) (
  rate(container_network_receive_bytes_total[5m]) +
  rate(container_network_transmit_bytes_total[5m])
)

Disk Usage

# Filesystem usage percentage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100

# PVC usage by namespace
sum by (namespace, persistentvolumeclaim) (
  kubelet_volume_stats_used_bytes
)

# Disk I/O rate
rate(container_fs_writes_bytes_total[5m])

Pod Status

# Pods not running
kube_pod_status_phase{phase!="Running"}

# Pod restart count
kube_pod_container_status_restarts_total

# Pods waiting (pending)
kube_pod_status_phase{phase="Pending"}

# Pods in crash loop
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}

Request Metrics (if instrumented)

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Request latency (p95)
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Check Specific Components

Prometheus Metrics

# Check Prometheus scrape targets
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/targets' | python3 -m json.tool

# Prometheus storage size
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/status/tsdb' | python3 -m json.tool

Grafana Metrics

# Grafana datasource queries
grafana_datasource_request_total

# Grafana dashboard loads
grafana_page_response_status_total

Keycloak Metrics (if exposed)

# Keycloak sessions
keycloak_sessions

# Keycloak login failures
keycloak_failed_login_attempts

Istio Metrics

# Istio requests
istio_requests_total

# Istio request duration
histogram_quantile(0.95,
  rate(istio_request_duration_milliseconds_bucket[5m])
)

# Istio error rate
rate(istio_requests_total{response_code=~"5.."}[5m])

Resource Monitoring via kubectl

Quick Resource Check

# Node resources
kubectl top nodes

# Pod resources (all namespaces)
kubectl top pods -A --sort-by=memory

# Pod resources (specific namespace)
kubectl top pods -n observability --sort-by=cpu

# Container resources in pod
kubectl top pod <pod-name> -n <namespace> --containers

Resource Limits and Requests

# Show resource requests/limits for deployment
kubectl describe deployment <name> -n <namespace> | grep -A 5 "Limits\|Requests"

# Show all pod resource requests
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'

Grafana Dashboards

Access: https://grafana.localtest.me:9443/dashboards

Key Dashboards:

  1. Kubernetes / Compute Resources / Cluster - Overall cluster metrics
  2. Kubernetes / Compute Resources / Namespace (Pods) - Per-namespace pod resources
  3. Kubernetes / Compute Resources / Pod - Individual pod metrics
  4. Prometheus - Prometheus self-monitoring
  5. Loki Logs - Log volume and patterns
  6. Istio Mesh - Service mesh metrics

Create Custom Queries in Grafana

  1. Navigate to Explore (compass icon in sidebar)
  2. Select Prometheus datasource
  3. Enter PromQL query
  4. Click "Run query"
  5. Optionally save to dashboard

Troubleshooting with Metrics

Issue: High CPU Usage

# Find pods using >80% CPU
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota / container_spec_cpu_period) by (namespace, pod, container) * 100 > 80

Issue: High Memory Usage

# Find pods using >80% memory
sum(container_memory_working_set_bytes) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes) by (namespace, pod, container) * 100 > 80

Issue: Service Not Responding

# Check if service endpoints are up
up{job="kubernetes-service-endpoints"}

# Check scrape failures
up == 0

Issue: Disk Full

# Find PVCs >80% full
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80

Alert Query Testing

When investigating alerts, test the PromQL query:

# Get alert query from Grafana
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
  -u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down'  # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
    query = rule['data'][0]['model']['expr']
    print(f'Query: {query}')
"

# Test the query
kubectl exec -n observability deployment/grafana -- \
  curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
  --data-urlencode "query=<QUERY_FROM_ABOVE>" | python3 -m json.tool

Metrics Collection Issues

Check if Metrics Are Being Scraped

# Check last scrape time
time() - timestamp(up)

# Check scrape duration
scrape_duration_seconds

Verify Metric Exists

# List all metrics
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | python3 -m json.tool

# Search for specific metric
kubectl exec -n observability deployment/grafana -- \
  curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | grep "your_metric"

Related Documentation

Pro Tips

  1. Use rate() for counters: rate(metric[5m]) instead of raw counter values
  2. Aggregate with by/without: sum by (namespace) (metric) to group metrics
  3. Use recording rules: For frequently used complex queries
  4. Set appropriate time ranges: Use [5m] for rate calculations
  5. Test queries in Explore first: Before adding to dashboards or alerts

🤖 Generated with Claude Code