Claude Code Plugins

Community-maintained marketplace

Feedback

Monitor application performance using the RED methodology (Rate, Errors, Duration) with Observe. Use when analyzing service health, investigating errors, tracking latency, or building APM dashboards. Covers when to use metrics vs spans, combining RED signals, and troubleshooting workflows. Cross-references working-with-intervals, aggregating-gauge-metrics, and analyzing-tdigest-metrics skills.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name analyzing-apm-data
description Monitor application performance using the RED methodology (Rate, Errors, Duration) with Observe. Use when analyzing service health, investigating errors, tracking latency, or building APM dashboards. Covers when to use metrics vs spans, combining RED signals, and troubleshooting workflows. Cross-references working-with-intervals, aggregating-gauge-metrics, and analyzing-tdigest-metrics skills.

Analyzing APM Data

Application Performance Monitoring (APM) tracks the health and performance of your services using telemetry data. This skill teaches the RED methodology and how to choose between pre-aggregated metrics and raw span data for different APM use cases.

When to Use This Skill

  • Monitoring microservices health and performance
  • Building service dashboards with Rate, Errors, Duration (RED)
  • Investigating production incidents and errors
  • Tracking latency and throughput across services
  • Setting up SLO (Service Level Objective) monitoring
  • Understanding when to use metrics vs spans for APM
  • Root cause analysis of service degradation
  • Analyzing service dependencies and call patterns
  • Identifying database bottlenecks and slow queries
  • Understanding which services depend on a failing component

Prerequisites

  • Access to Observe tenant with OpenTelemetry span data
  • Understanding of distributed tracing concepts
  • Familiarity with metrics queries (see aggregating-gauge-metrics, analyzing-tdigest-metrics)
  • Familiarity with interval queries (see working-with-intervals)

Key Concepts

The RED Methodology

RED is a standard approach for monitoring microservices health:

R - Rate: Request volume (requests per second/minute/hour)

  • How much traffic is each service handling?
  • Are there traffic spikes or drops?

E - Errors: Error count and error rate (percentage)

  • How many requests are failing?
  • Which services have the highest error rates?

D - Duration: Latency percentiles (p50, p95, p99)

  • How fast are services responding?
  • Are there latency spikes?

Metrics vs Spans Decision Tree

Observe provides two ways to analyze APM data:

Metrics (Pre-aggregated, Fast)

  • Request counts: span_call_count_5m
  • Error counts: span_error_count_5m
  • Latency percentiles: span_*_duration_tdigest_5m
  • Pre-aggregated at 5-minute intervals

Spans (Raw Data, Detailed)

  • Dataset: OpenTelemetry/Span (interface: otel_span)
  • Complete trace information
  • All span attributes available

Decision Matrix:

Need Use Why
Dashboard, 24h+ range Metrics Fast, efficient, pre-aggregated
"How many errors?" Metrics Quick counts, good for alerts
"What went wrong?" Spans Error messages, stack traces
Latency overview Metrics (tdigest) Efficient percentiles
Filter by endpoint Spans Full attribute filtering
Real-time monitoring Metrics Consistent performance
Root cause analysis Spans Complete context, trace IDs
Service dependencies Spans Trace relationships, span types
Database bottlenecks Spans DB attributes, query details

Recommended Workflow:

  1. Start with metrics (fast overview)
  2. Identify anomalies (high errors, latency spikes)
  3. Drill down with spans (detailed investigation)

Discovery Workflow

Step 1: Find APM metrics

discover_context("request error latency", result_type="metric")

Common metrics:

  • span_call_count_5m - Request counts (gauge)
  • span_error_count_5m - Error counts (gauge)
  • span_sn_service_node_duration_tdigest_5m - Latency (tdigest)

Step 2: Find span dataset

discover_context("opentelemetry span trace")

Look for: OpenTelemetry/Span (interface: otel_span)

Step 3: Get detailed schemas

discover_context(metric_name="span_call_count_5m")
discover_context(dataset_id="<span_dataset_id>")

RED Methodology Patterns

Pattern 1: Rate - Request Volume (Metrics)

Get total requests per service over 24 hours:

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
| sort desc(total_requests)
| limit 10

Output: One row per service with total request count.

Use case: Identify highest-traffic services, detect traffic drops.

Pattern 2: Errors - Error Count (Metrics)

Get total errors per service:

align options(bins: 1), errors:sum(m("span_error_count_5m"))
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0
| sort desc(total_errors)
| limit 10

Output: Services ranked by error count (including zeros with fill).

Use case: Quick error overview, dashboard widgets.

Pattern 3: Duration - Latency Percentiles (Metrics)

Get latency percentiles per service:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
          p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          p99:tdigest_quantile(tdigest_combine(combined), 0.99),
          group_by(service_name)
| make_col p50_ms:p50/1000000, p95_ms:p95/1000000, p99_ms:p99/1000000
| sort desc(p95_ms)
| limit 10

Output: Latency percentiles in milliseconds per service.

Use case: SLO tracking, latency monitoring, performance comparison.

Note: Uses tdigest double-combine pattern (see analyzing-tdigest-metrics skill).

Pattern 4: Combined RED Dashboard (Metrics)

Calculate Rate, Errors, and Error Rate together:

align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)
| limit 10

Output: Services with request count, error count, and error rate percentage.

Use case: Complete service health dashboard.

Pattern 5: Trending - Requests Over Time (Metrics)

Track request volume over time for charting:

align 1h, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_hour:sum(rate), group_by(service_name)
| filter service_name = "frontend"

Output: Hourly time-series data for one service.

Use case: Dashboard charts, trend analysis, capacity planning.

Pipe rule: Time-series (1h) requires pipe | after align.

Detailed Investigation Patterns (Spans)

Pattern 6: Rate from Spans (Detailed)

Count requests from raw spans (1-hour window recommended):

make_col svc:service_name
| statsby request_count:count(), group_by(svc)
| sort desc(request_count)
| limit 10

Use case: When you need exact counts for short time ranges or want to filter by span attributes first.

Note: Use shorter time ranges (1h) for performance.

Pattern 7: Error Analysis with Messages (Spans)

Get error details including messages:

filter error = true
| make_col svc:service_name,
          error_msg:string(error_message),
          span:span_name,
          status:string(status_code)
| statsby error_count:count(), group_by(svc, error_msg, span)
| sort desc(error_count)
| limit 10

Output: Error counts WITH full error messages and span names.

Use case: Root cause analysis - understand WHY errors happened.

Key advantage: Spans show WHAT went wrong, metrics only show HOW MANY.

Pattern 8: Latency from Spans (Filtered)

Calculate latency percentiles for specific span types:

filter span_type = "Service entry point"
| make_col svc:service_name, dur_ms:duration / 1ms
| statsby p50:percentile(dur_ms, 0.50),
          p95:percentile(dur_ms, 0.95),
          p99:percentile(dur_ms, 0.99),
          group_by(svc)
| sort desc(p95)
| limit 10

Output: Latency percentiles for user-facing requests only.

Use case: Focus on end-user experience, exclude internal calls.

Duration conversion: duration / 1ms converts nanoseconds to milliseconds.

Pattern 9: Error Rate from Spans

Calculate error rate using raw span data:

make_col svc:service_name, is_error:if(error = true, 1, 0)
| statsby total:count(), error_count:sum(is_error), group_by(svc)
| make_col error_rate:float64(error_count) / float64(total) * 100.0
| sort desc(error_rate)
| limit 10

Output: Services with total requests, errors, and error rate.

Use case: When you need to filter spans first (e.g., only specific endpoints).

Complete APM Workflow Example

Scenario: You notice performance issues and need to investigate.

Step 1: Check overall service health (Metrics - Fast)

align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)

Finding: cartservice has 0.48% error rate - investigate further.

Step 2: Get error details (Spans - Detailed)

filter error = true
| filter service_name = "cartservice"
| make_col error_msg:string(error_message),
          span:span_name,
          endpoint:string(attributes.\"http.target\")
| statsby error_count:count(), group_by(error_msg, span, endpoint)
| sort desc(error_count)

Finding: "Can't access cart storage. System.ApplicationException: Wasn't able to connect to redis" - Redis connection issue!

Step 3: Check latency impact (Metrics)

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| filter service_name = "cartservice"
| make_col p95_ms:p95/1000000

Finding: p95 latency is elevated - errors are impacting response time.

Step 4: View hourly trend (Metrics)

align 1h, errors:sum(m("span_error_count_5m"))
| aggregate errors_per_hour:sum(errors), group_by(service_name)
| filter service_name = "cartservice"

Finding: Errors started 3 hours ago - correlate with Redis deployment?

Result: Complete picture of issue (what, when, impact) using metrics + spans.

Dependency Tracking and Service Relationships

Understanding service dependencies is critical for root cause analysis. Spans contain relationship information through trace IDs and span types.

Key Span Fields for Dependency Tracking

Trace relationships:

  • trace_id - Links all spans in a single request
  • span_id - Unique identifier for this span
  • parent_span_id - Links to calling span

Span direction:

  • kind - "CLIENT" (outgoing call) or "SERVER" (incoming request)
  • span_type - "Service entry point", "Remote call", "Database call"

Target identification:

  • span_name - Operation name (e.g., "grpc.oteldemo.CartService/GetCart")
  • attributes.db.* - Database connection details

Pattern 10: What Services Does X Depend On? (Downstream)

Find all services that a given service calls:

filter service_name = "frontend" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
          p95_ms:percentile(dur_ms, 0.95),
          group_by(target)
| sort desc(calls)

Output: All downstream dependencies with call volume and latency.

Use case: "Frontend is slow - which downstream services are contributing?"

Pattern 11: What Services Depend On X? (Upstream)

Find all services that call a given service (requires trace traversal):

filter span_type = "Service entry point"
| make_col svc:service_name
| statsby total:count(), group_by(svc)
| sort desc(total)

Output: Services ordered by request volume (entry points).

Note: Full upstream analysis requires joining spans by trace_id (advanced).

Pattern 12: Database Dependencies

Identify which services are making database calls:

filter string(attributes."db.type") = "sql"
| make_col caller:service_name,
          db_name:string(attributes."db.name"),
          dur_ms:duration / 1ms
| statsby call_count:count(),
          p95_latency:percentile(dur_ms, 0.95),
          group_by(caller, db_name)
| sort desc(call_count)

Output: Service-to-database call patterns with latency.

Use case: "Are there any slow database queries impacting my services?"

Pattern 13: Service Call Patterns (All Outbound)

Get a complete picture of all outbound service calls:

filter kind = "CLIENT"
| make_col caller:service_name, target:span_name
| statsby call_count:count(), group_by(caller, target)
| sort desc(call_count)
| limit 20

Output: Complete service-to-service call matrix.

Use case: Understanding service dependencies architecture-wide.

Dependency Troubleshooting Workflow

Scenario: Service X has elevated latency - is it a downstream dependency?

Step 1: Identify service's dependencies

filter service_name = "X" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
          p95_ms:percentile(dur_ms, 0.95),
          group_by(target)
| sort desc(p95_ms)

Step 2: Check if database calls are slow

filter service_name = "X"
| filter string(attributes."db.type") = "sql"
| make_col db:string(attributes."db.name"),
          query:string(attributes."db.statement"),
          dur_ms:duration / 1ms
| statsby avg_ms:avg(dur_ms),
          p95_ms:percentile(dur_ms, 0.95),
          count:count(),
          group_by(db, query)
| sort desc(p95_ms)
| limit 10

Step 3: Compare to baseline Run same queries for different time range to see if latency changed.

Result: Identify which downstream dependency is causing the issue.

Common Use Cases

Use Case 1: Service Health Dashboard

Goal: Real-time overview of all services.

Solution: Use metrics with summary pattern:

align options(bins: 1),
      requests:sum(m("span_call_count_5m")),
      errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
          total_errors:sum(errors),
          group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0

Add latency query separately (tdigest metrics).

Why metrics: Fast, efficient, updates every 5 minutes.

Use Case 2: SLO Tracking

Goal: Track p95 latency against 100ms SLO.

Solution: Use tdigest metrics:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
          group_by(service_name)
| make_col p95_ms:p95/1000000, slo_breach:if(p95/1000000 > 100, true, false)
| filter slo_breach = true

Output: Services exceeding 100ms p95 SLO.

Use Case 3: Error Investigation

Goal: Understand why a specific service is failing.

Solution: Start with metrics (count), drill down with spans (details):

filter error = true
| filter service_name = "target-service"
| make_col error_msg:string(error_message),
          trace:trace_id,
          timestamp:start_time
| sort desc(timestamp)
| limit 50

Output: Recent errors with messages and trace IDs for further investigation.

Use Case 4: Capacity Planning

Goal: Understand traffic patterns over 30 days.

Solution: Use metrics with daily buckets:

align 1d, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_day:sum(rate), group_by(service_name)

Time range: Set to 30d in query parameters.

Output: Daily request volume for capacity analysis.

Best Practices

  1. Start with metrics for dashboards, alerts, and overview queries
  2. Use spans for investigation when you need details or filtering
  3. Keep time ranges short when querying spans (1h recommended, max 24h)
  4. Use time-series patterns (align 1h) for trending charts
  5. Use summary patterns (options(bins: 1)) for single statistics
  6. Combine metrics (requests + errors) in single query for efficiency
  7. Filter spans early to improve performance (filter before make_col)
  8. Use fill to show services with zero errors in dashboards

Performance Considerations

Metrics queries:

  • Consistent performance across time ranges (1h same speed as 30d)
  • Ideal for: Dashboards, real-time monitoring, long time ranges
  • Limitation: Fixed dimensions, no custom filtering

Span queries:

  • Performance degrades with longer time ranges
  • Ideal for: Short time windows (1h), detailed investigation, filtered analysis
  • Limitation: Slower for high-volume services over long periods

Recommended approach:

  • Dashboards: 100% metrics
  • Alerts: Metrics
  • Investigation: Metrics (overview) → Spans (details)
  • Trace analysis: Spans only

Common Pitfalls

Pitfall 1: Using Spans for Long Time Ranges

Wrong:

make_col svc:service_name
| statsby request_count:count(), group_by(svc)

With 30-day time range - very slow!

Correct:

align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)

Metrics handle 30 days efficiently.

Pitfall 2: Forgetting Time Units

Wrong:

make_col dur_ms:float64(duration)/1000000

Correct:

make_col dur_ms:duration / 1ms

Use OPAL duration units (see working-with-intervals skill).

Pitfall 3: Not Using fill for Zeros

Wrong:

aggregate total_errors:sum(errors), group_by(service_name)

Services with zero errors won't appear.

Correct:

aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0

All services shown (important for dashboards).

Pitfall 4: Wrong Metric for Latency

Wrong:

align options(bins: 1), dur:avg(m("span_duration_5m"))

Don't use avg() on pre-aggregated duration.

Correct:

align options(bins: 1), combined:tdigest_combine(m_tdigest("span_*_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)

Use tdigest metrics for percentiles.

Related Skills

  • working-with-intervals - Understanding span duration and temporal queries
  • aggregating-gauge-metrics - Request and error count metrics (gauge type)
  • analyzing-tdigest-metrics - Latency percentile metrics (tdigest type)
  • filtering-event-datasets - Filtering techniques applicable to spans
  • time-series-analysis - Time-bucketed analysis with timechart (alternative to align)

Summary

APM in Observe uses the RED methodology:

  • Rate: Request counts from span_call_count_5m metric
  • Errors: Error counts from span_error_count_5m metric
  • Duration: Latency percentiles from span_*_duration_tdigest_5m metric

Key decision: Metrics for speed and overview, Spans for detail and investigation.

Workflow: Metrics (identify issues) → Spans (root cause analysis) → Traces (follow request flow).

Performance: Metrics scale to long time ranges, spans are best for short windows.


Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (OpenTelemetry/Span + ServiceExplorer Metrics)