Claude Code Plugins

Community-maintained marketplace

Feedback

aggregating-gauge-metrics

@rustomax/observe-community-mcp
1
0

Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name aggregating-gauge-metrics
description Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.

Aggregating Gauge Metrics

Pre-computed metrics in Observe store aggregated measurements at regular intervals (typically every 5 minutes). This skill teaches how to query gauge, counter, and delta metric types using OPAL.

When to Use This Skill

  • Analyzing request counts, error rates, or throughput metrics
  • Tracking resource utilization (CPU, memory, network)
  • Computing totals, averages, or rates across time periods
  • Creating dashboards with time-series charts
  • Working with any gauge, counter, or delta metric type
  • When you need summary statistics or trends over time

Prerequisites

  • Access to Observe tenant via MCP
  • Understanding that metrics are pre-aggregated (not raw events)
  • Metric dataset with type: gauge, counter, or delta
  • Use discover_context() to find and inspect metrics

Key Concepts

What Are Gauge Metrics?

Gauge metrics are pre-aggregated numeric measurements collected at regular intervals:

Pre-aggregated: Already summarized at collection time (typically 5-minute intervals)

  • More efficient than querying raw data
  • Faster query performance
  • Lower storage costs

Common Metric Types:

  • Gauge: Point-in-time value (CPU utilization, memory usage, queue depth)
  • Counter: Monotonically increasing value (total requests, bytes sent)
  • Delta: Change between intervals (requests per interval, errors per interval)

Examples:

  • span_call_count_5m - Number of requests per 5-minute interval
  • span_error_count_5m - Number of errors per 5-minute interval
  • system_cpu_utilization_ratio - CPU utilization percentage
  • k8s_pod_memory_available_bytes - Available memory in bytes

CRITICAL: The align Verb is REQUIRED

Unlike datasets (Events/Intervals), metrics MUST use the align verb:

# WRONG - Will not work ❌
m("span_call_count_5m")
| statsby total:sum(metric)

# CORRECT - Must use align ✅
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate)

Why align is required: Metrics are stored as time-series data that must be aligned to a time grid before aggregation.

Summary vs Time-Series Output

OPAL metrics queries can produce two different output types:

Output Type Pattern Result Use Case
Summary options(bins: 1) One row per group Totals, overall statistics
Time-Series 5m, 1h, or default Many rows per group Trending, dashboards, charts

Summary pattern - Single statistics across entire time range:

align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(service_name)

Output: One row per service

Time-series pattern - Values over time buckets:

align 5m, rate:sum(m("metric"))
| aggregate total:sum(rate), group_by(service_name)

Output: Multiple rows per service (one per 5-minute bucket)

CRITICAL Syntax Difference:

  • Summary (bins: 1): NO pipe | between align and aggregate
  • Time-series (5m): YES pipe | between align and aggregate

Discovery Workflow

Step 1: Search for metrics

discover_context("request count", result_type="metric")
discover_context("error", result_type="metric")
discover_context("cpu memory", result_type="metric")

Step 2: Get detailed metric schema

discover_context(metric_name="span_call_count_5m")

Step 3: Verify metric type Look for: Type: gauge (or counter, delta)

Step 4: Note available dimensions These are used for group_by():

  • service_name, service_namespace
  • environment, span_name
  • k8s_namespace_name, k8s_pod_name
  • etc. (shown in discovery output)

Step 5: Write query Use align + m() + aggregate pattern with correct dimensions

Basic Patterns

Pattern 1: Total Count Across Time Range

Get overall totals (summary output):

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate)

Output: Single row with total count across entire time range.

No group_by: Aggregates everything together.

Pattern 2: Totals Per Group

Get totals broken down by dimension:

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)

Output: One row per service with total requests.

group_by: Use any dimension from metric schema.

Pattern 3: Average Values Per Group

Calculate averages across time range:

align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio"))
aggregate avg_cpu:avg(cpu), group_by(service_name)

Output: Average CPU utilization per service.

avg() function: Used twice - once in align, once in aggregate.

Pattern 4: Multiple Aggregations

Compute several statistics together:

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total:sum(rate),
          average:avg(rate),
          maximum:max(rate),
          group_by(service_name)

Output: Multiple columns per service (total, average, maximum).

Pattern 5: Time-Series for Trending

Track values over time buckets:

align 5m, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_5min:sum(rate), group_by(service_name)

Output: Multiple rows per service (one per 5-minute interval).

Note: Pipe | required after align for time-series pattern.

Output columns:

  • _c_bucket - Time bucket identifier
  • valid_from, valid_to - Bucket boundaries
  • Metric values

Common Use Cases

Counting Total Requests by Service

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
| sort desc(total_requests)
| limit 10

Use case: Identify top services by request volume.

Counting Errors with Fill for Zero Values

align options(bins: 1), errors:sum(m("span_error_count_5m"))
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0

Use case: Show all services, even those with zero errors.

fill verb: Replaces null values with 0.

Tracking Request Rate Over Time

align 1h, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_hour:sum(rate), group_by(service_name)

Use case: Hourly request trends for dashboards.

Output: Time-series data for charting.

Multiple Metrics in One Query

align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
| make_col error_rate:float64(total_errors) / float64(total_requests)

Use case: Calculate error rate from two metrics.

make_col: Add derived column after aggregation.

Resource Utilization Averages

align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio"))
aggregate avg_cpu:avg(cpu),
          max_cpu:max(cpu),
          group_by(k8s_pod_name)
| sort desc(avg_cpu)
| limit 20

Use case: Find pods with highest CPU usage.

Complete Example

Scenario: You want to analyze request and error rates for your microservices over the last 24 hours.

Step 1: Discover available metrics

discover_context("request error", result_type="metric")

Found metrics:

  • span_call_count_5m (type: gauge)
  • span_error_count_5m (type: gauge)

Step 2: Get metric details

discover_context(metric_name="span_call_count_5m")

Available dimensions: service_name, service_namespace, environment, span_name

Step 3: Query for summary statistics

align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(total_requests)

Step 4: Interpret results

service_name total_requests total_errors error_rate
frontend-proxy 15660 0 0.0
frontend 15263 35 0.23
featureflagservice 11693 0 0.0
productcatalogservice 8813 0 0.0

Insight: Frontend has a 0.23% error rate - investigate errors.

Step 5: Get hourly trends

align 1h,
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
| aggregate requests_per_hour:sum(requests),
            errors_per_hour:sum(errors),
            group_by(service_name)
| filter service_name = "frontend"

Output: Time-series showing frontend requests and errors per hour.

Common Pitfalls

Pitfall 1: Forgetting align Verb

Wrong:

m("span_call_count_5m")
| statsby total:sum(metric)

Correct:

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total:sum(rate)

Why: Metrics MUST use align verb - it's required, not optional.

Pitfall 2: Wrong Pipe Usage

Wrong (pipe with bins:1):

align options(bins: 1), rate:sum(m("metric"))
| aggregate total:sum(rate)

Wrong (no pipe with time duration):

align 5m, rate:sum(m("metric"))
aggregate total:sum(rate)

Correct:

# Summary - NO pipe
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate)

# Time-series - YES pipe
align 5m, rate:sum(m("metric"))
| aggregate total:sum(rate)

Why: Syntax differs between summary and time-series patterns.

Pitfall 3: Grouping by Non-Existent Dimension

Wrong:

align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(service_name)

Error: "field 'service_name' does not exist"

Correct:

# First: discover_context(metric_name="metric") to see available dimensions
# Then: use only dimensions that exist
align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate), group_by(correct_dimension_name)

Why: Not all metrics have the same dimensions - always check first.

Pitfall 4: Using statsby Instead of aggregate

Wrong:

align options(bins: 1), rate:sum(m("metric"))
statsby total:sum(rate)

Correct:

align options(bins: 1), rate:sum(m("metric"))
aggregate total:sum(rate)

Why: After align, use aggregate (not statsby which is for datasets).

Aggregation Functions Reference

Common functions used with gauge metrics:

# Summing values
align options(bins: 1), metric:sum(m("metric_name"))
aggregate total:sum(metric)

# Averaging values
align options(bins: 1), metric:avg(m("metric_name"))
aggregate average:avg(metric)

# Maximum value
align options(bins: 1), metric:max(m("metric_name"))
aggregate maximum:max(metric)

# Minimum value
align options(bins: 1), metric:min(m("metric_name"))
aggregate minimum:min(metric)

# Count of samples
align options(bins: 1), metric:count(m("metric_name"))
aggregate sample_count:count(metric)

Pattern: Function used in both align and aggregate.

Time Bucket Options

Common time durations for time-series queries:

align 1m, ...    # 1-minute buckets
align 5m, ...    # 5-minute buckets (common)
align 15m, ...   # 15-minute buckets
align 1h, ...    # 1-hour buckets
align 1d, ...    # 1-day buckets

Default: align without duration uses automatic binning (300 bins).

Best Practices

  1. Always use discover_context() first to find metrics and check dimensions
  2. Verify metric type - this skill is for gauge/counter/delta (NOT tdigest)
  3. Use summary pattern (bins: 1) for single statistics, reports, totals
  4. Use time-series pattern (5m, 1h) for dashboards, trending, charts
  5. Remember pipe rule: bins:1 = no pipe, time duration = yes pipe
  6. Use fill to replace nulls with zeros for complete results
  7. Add sort + limit for top-N queries to avoid overwhelming output
  8. Check available dimensions before using group_by

Related Skills

  • analyzing-tdigest-metrics - For percentile metrics (latency, duration p95/p99)
  • time-series-analysis - For event/interval trending with timechart (different from metrics)
  • aggregating-event-datasets - For aggregating raw events with statsby (different from metrics)
  • working-with-intervals - For calculating durations from raw interval data

Summary

Gauge metrics are pre-aggregated measurements that require the align verb:

  • Core pattern: align + m() + aggregate
  • Metric types: gauge, counter, delta (NOT tdigest)
  • Two output modes:
    • Summary: options(bins: 1) → one row per group, NO pipe
    • Time-series: 5m, 1h → many rows per group, YES pipe
  • Common functions: sum, avg, max, min, count
  • Discovery: Use discover_context() to find metrics and dimensions

Key distinction: Metrics are pre-aggregated (use align), while Events/Intervals are raw data (use statsby/timechart).


Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Metrics)