| name | analyzing-apm-data |
| description | Monitor application performance using the RED methodology (Rate, Errors, Duration) with Observe. Use when analyzing service health, investigating errors, tracking latency, or building APM dashboards. Covers when to use metrics vs spans, combining RED signals, and troubleshooting workflows. Cross-references working-with-intervals, aggregating-gauge-metrics, and analyzing-tdigest-metrics skills. |
Analyzing APM Data
Application Performance Monitoring (APM) tracks the health and performance of your services using telemetry data. This skill teaches the RED methodology and how to choose between pre-aggregated metrics and raw span data for different APM use cases.
When to Use This Skill
- Monitoring microservices health and performance
- Building service dashboards with Rate, Errors, Duration (RED)
- Investigating production incidents and errors
- Tracking latency and throughput across services
- Setting up SLO (Service Level Objective) monitoring
- Understanding when to use metrics vs spans for APM
- Root cause analysis of service degradation
- Analyzing service dependencies and call patterns
- Identifying database bottlenecks and slow queries
- Understanding which services depend on a failing component
Prerequisites
- Access to Observe tenant with OpenTelemetry span data
- Understanding of distributed tracing concepts
- Familiarity with metrics queries (see aggregating-gauge-metrics, analyzing-tdigest-metrics)
- Familiarity with interval queries (see working-with-intervals)
Key Concepts
The RED Methodology
RED is a standard approach for monitoring microservices health:
R - Rate: Request volume (requests per second/minute/hour)
- How much traffic is each service handling?
- Are there traffic spikes or drops?
E - Errors: Error count and error rate (percentage)
- How many requests are failing?
- Which services have the highest error rates?
D - Duration: Latency percentiles (p50, p95, p99)
- How fast are services responding?
- Are there latency spikes?
Metrics vs Spans Decision Tree
Observe provides two ways to analyze APM data:
Metrics (Pre-aggregated, Fast)
- Request counts:
span_call_count_5m - Error counts:
span_error_count_5m - Latency percentiles:
span_*_duration_tdigest_5m - Pre-aggregated at 5-minute intervals
Spans (Raw Data, Detailed)
- Dataset: OpenTelemetry/Span (interface:
otel_span) - Complete trace information
- All span attributes available
Decision Matrix:
| Need | Use | Why |
|---|---|---|
| Dashboard, 24h+ range | Metrics | Fast, efficient, pre-aggregated |
| "How many errors?" | Metrics | Quick counts, good for alerts |
| "What went wrong?" | Spans | Error messages, stack traces |
| Latency overview | Metrics (tdigest) | Efficient percentiles |
| Filter by endpoint | Spans | Full attribute filtering |
| Real-time monitoring | Metrics | Consistent performance |
| Root cause analysis | Spans | Complete context, trace IDs |
| Service dependencies | Spans | Trace relationships, span types |
| Database bottlenecks | Spans | DB attributes, query details |
Recommended Workflow:
- Start with metrics (fast overview)
- Identify anomalies (high errors, latency spikes)
- Drill down with spans (detailed investigation)
Discovery Workflow
Step 1: Find APM metrics
discover_context("request error latency", result_type="metric")
Common metrics:
span_call_count_5m- Request counts (gauge)span_error_count_5m- Error counts (gauge)span_sn_service_node_duration_tdigest_5m- Latency (tdigest)
Step 2: Find span dataset
discover_context("opentelemetry span trace")
Look for: OpenTelemetry/Span (interface: otel_span)
Step 3: Get detailed schemas
discover_context(metric_name="span_call_count_5m")
discover_context(dataset_id="<span_dataset_id>")
RED Methodology Patterns
Pattern 1: Rate - Request Volume (Metrics)
Get total requests per service over 24 hours:
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
| sort desc(total_requests)
| limit 10
Output: One row per service with total request count.
Use case: Identify highest-traffic services, detect traffic drops.
Pattern 2: Errors - Error Count (Metrics)
Get total errors per service:
align options(bins: 1), errors:sum(m("span_error_count_5m"))
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0
| sort desc(total_errors)
| limit 10
Output: Services ranked by error count (including zeros with fill).
Use case: Quick error overview, dashboard widgets.
Pattern 3: Duration - Latency Percentiles (Metrics)
Get latency percentiles per service:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| make_col p50_ms:p50/1000000, p95_ms:p95/1000000, p99_ms:p99/1000000
| sort desc(p95_ms)
| limit 10
Output: Latency percentiles in milliseconds per service.
Use case: SLO tracking, latency monitoring, performance comparison.
Note: Uses tdigest double-combine pattern (see analyzing-tdigest-metrics skill).
Pattern 4: Combined RED Dashboard (Metrics)
Calculate Rate, Errors, and Error Rate together:
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)
| limit 10
Output: Services with request count, error count, and error rate percentage.
Use case: Complete service health dashboard.
Pattern 5: Trending - Requests Over Time (Metrics)
Track request volume over time for charting:
align 1h, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_hour:sum(rate), group_by(service_name)
| filter service_name = "frontend"
Output: Hourly time-series data for one service.
Use case: Dashboard charts, trend analysis, capacity planning.
Pipe rule: Time-series (1h) requires pipe | after align.
Detailed Investigation Patterns (Spans)
Pattern 6: Rate from Spans (Detailed)
Count requests from raw spans (1-hour window recommended):
make_col svc:service_name
| statsby request_count:count(), group_by(svc)
| sort desc(request_count)
| limit 10
Use case: When you need exact counts for short time ranges or want to filter by span attributes first.
Note: Use shorter time ranges (1h) for performance.
Pattern 7: Error Analysis with Messages (Spans)
Get error details including messages:
filter error = true
| make_col svc:service_name,
error_msg:string(error_message),
span:span_name,
status:string(status_code)
| statsby error_count:count(), group_by(svc, error_msg, span)
| sort desc(error_count)
| limit 10
Output: Error counts WITH full error messages and span names.
Use case: Root cause analysis - understand WHY errors happened.
Key advantage: Spans show WHAT went wrong, metrics only show HOW MANY.
Pattern 8: Latency from Spans (Filtered)
Calculate latency percentiles for specific span types:
filter span_type = "Service entry point"
| make_col svc:service_name, dur_ms:duration / 1ms
| statsby p50:percentile(dur_ms, 0.50),
p95:percentile(dur_ms, 0.95),
p99:percentile(dur_ms, 0.99),
group_by(svc)
| sort desc(p95)
| limit 10
Output: Latency percentiles for user-facing requests only.
Use case: Focus on end-user experience, exclude internal calls.
Duration conversion: duration / 1ms converts nanoseconds to milliseconds.
Pattern 9: Error Rate from Spans
Calculate error rate using raw span data:
make_col svc:service_name, is_error:if(error = true, 1, 0)
| statsby total:count(), error_count:sum(is_error), group_by(svc)
| make_col error_rate:float64(error_count) / float64(total) * 100.0
| sort desc(error_rate)
| limit 10
Output: Services with total requests, errors, and error rate.
Use case: When you need to filter spans first (e.g., only specific endpoints).
Complete APM Workflow Example
Scenario: You notice performance issues and need to investigate.
Step 1: Check overall service health (Metrics - Fast)
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
| sort desc(error_rate)
Finding: cartservice has 0.48% error rate - investigate further.
Step 2: Get error details (Spans - Detailed)
filter error = true
| filter service_name = "cartservice"
| make_col error_msg:string(error_message),
span:span_name,
endpoint:string(attributes.\"http.target\")
| statsby error_count:count(), group_by(error_msg, span, endpoint)
| sort desc(error_count)
Finding: "Can't access cart storage. System.ApplicationException: Wasn't able to connect to redis" - Redis connection issue!
Step 3: Check latency impact (Metrics)
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| filter service_name = "cartservice"
| make_col p95_ms:p95/1000000
Finding: p95 latency is elevated - errors are impacting response time.
Step 4: View hourly trend (Metrics)
align 1h, errors:sum(m("span_error_count_5m"))
| aggregate errors_per_hour:sum(errors), group_by(service_name)
| filter service_name = "cartservice"
Finding: Errors started 3 hours ago - correlate with Redis deployment?
Result: Complete picture of issue (what, when, impact) using metrics + spans.
Dependency Tracking and Service Relationships
Understanding service dependencies is critical for root cause analysis. Spans contain relationship information through trace IDs and span types.
Key Span Fields for Dependency Tracking
Trace relationships:
trace_id- Links all spans in a single requestspan_id- Unique identifier for this spanparent_span_id- Links to calling span
Span direction:
kind- "CLIENT" (outgoing call) or "SERVER" (incoming request)span_type- "Service entry point", "Remote call", "Database call"
Target identification:
span_name- Operation name (e.g., "grpc.oteldemo.CartService/GetCart")attributes.db.*- Database connection details
Pattern 10: What Services Does X Depend On? (Downstream)
Find all services that a given service calls:
filter service_name = "frontend" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
p95_ms:percentile(dur_ms, 0.95),
group_by(target)
| sort desc(calls)
Output: All downstream dependencies with call volume and latency.
Use case: "Frontend is slow - which downstream services are contributing?"
Pattern 11: What Services Depend On X? (Upstream)
Find all services that call a given service (requires trace traversal):
filter span_type = "Service entry point"
| make_col svc:service_name
| statsby total:count(), group_by(svc)
| sort desc(total)
Output: Services ordered by request volume (entry points).
Note: Full upstream analysis requires joining spans by trace_id (advanced).
Pattern 12: Database Dependencies
Identify which services are making database calls:
filter string(attributes."db.type") = "sql"
| make_col caller:service_name,
db_name:string(attributes."db.name"),
dur_ms:duration / 1ms
| statsby call_count:count(),
p95_latency:percentile(dur_ms, 0.95),
group_by(caller, db_name)
| sort desc(call_count)
Output: Service-to-database call patterns with latency.
Use case: "Are there any slow database queries impacting my services?"
Pattern 13: Service Call Patterns (All Outbound)
Get a complete picture of all outbound service calls:
filter kind = "CLIENT"
| make_col caller:service_name, target:span_name
| statsby call_count:count(), group_by(caller, target)
| sort desc(call_count)
| limit 20
Output: Complete service-to-service call matrix.
Use case: Understanding service dependencies architecture-wide.
Dependency Troubleshooting Workflow
Scenario: Service X has elevated latency - is it a downstream dependency?
Step 1: Identify service's dependencies
filter service_name = "X" and kind = "CLIENT"
| make_col target:span_name, dur_ms:duration / 1ms
| statsby calls:count(),
p95_ms:percentile(dur_ms, 0.95),
group_by(target)
| sort desc(p95_ms)
Step 2: Check if database calls are slow
filter service_name = "X"
| filter string(attributes."db.type") = "sql"
| make_col db:string(attributes."db.name"),
query:string(attributes."db.statement"),
dur_ms:duration / 1ms
| statsby avg_ms:avg(dur_ms),
p95_ms:percentile(dur_ms, 0.95),
count:count(),
group_by(db, query)
| sort desc(p95_ms)
| limit 10
Step 3: Compare to baseline Run same queries for different time range to see if latency changed.
Result: Identify which downstream dependency is causing the issue.
Common Use Cases
Use Case 1: Service Health Dashboard
Goal: Real-time overview of all services.
Solution: Use metrics with summary pattern:
align options(bins: 1),
requests:sum(m("span_call_count_5m")),
errors:sum(m("span_error_count_5m"))
aggregate total_requests:sum(requests),
total_errors:sum(errors),
group_by(service_name)
fill total_errors:0
| make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0
Add latency query separately (tdigest metrics).
Why metrics: Fast, efficient, updates every 5 minutes.
Use Case 2: SLO Tracking
Goal: Track p95 latency against 100ms SLO.
Solution: Use tdigest metrics:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| make_col p95_ms:p95/1000000, slo_breach:if(p95/1000000 > 100, true, false)
| filter slo_breach = true
Output: Services exceeding 100ms p95 SLO.
Use Case 3: Error Investigation
Goal: Understand why a specific service is failing.
Solution: Start with metrics (count), drill down with spans (details):
filter error = true
| filter service_name = "target-service"
| make_col error_msg:string(error_message),
trace:trace_id,
timestamp:start_time
| sort desc(timestamp)
| limit 50
Output: Recent errors with messages and trace IDs for further investigation.
Use Case 4: Capacity Planning
Goal: Understand traffic patterns over 30 days.
Solution: Use metrics with daily buckets:
align 1d, rate:sum(m("span_call_count_5m"))
| aggregate requests_per_day:sum(rate), group_by(service_name)
Time range: Set to 30d in query parameters.
Output: Daily request volume for capacity analysis.
Best Practices
- Start with metrics for dashboards, alerts, and overview queries
- Use spans for investigation when you need details or filtering
- Keep time ranges short when querying spans (1h recommended, max 24h)
- Use time-series patterns (
align 1h) for trending charts - Use summary patterns (
options(bins: 1)) for single statistics - Combine metrics (requests + errors) in single query for efficiency
- Filter spans early to improve performance (
filterbeforemake_col) - Use fill to show services with zero errors in dashboards
Performance Considerations
Metrics queries:
- Consistent performance across time ranges (1h same speed as 30d)
- Ideal for: Dashboards, real-time monitoring, long time ranges
- Limitation: Fixed dimensions, no custom filtering
Span queries:
- Performance degrades with longer time ranges
- Ideal for: Short time windows (1h), detailed investigation, filtered analysis
- Limitation: Slower for high-volume services over long periods
Recommended approach:
- Dashboards: 100% metrics
- Alerts: Metrics
- Investigation: Metrics (overview) → Spans (details)
- Trace analysis: Spans only
Common Pitfalls
Pitfall 1: Using Spans for Long Time Ranges
❌ Wrong:
make_col svc:service_name
| statsby request_count:count(), group_by(svc)
With 30-day time range - very slow!
✅ Correct:
align options(bins: 1), rate:sum(m("span_call_count_5m"))
aggregate total_requests:sum(rate), group_by(service_name)
Metrics handle 30 days efficiently.
Pitfall 2: Forgetting Time Units
❌ Wrong:
make_col dur_ms:float64(duration)/1000000
✅ Correct:
make_col dur_ms:duration / 1ms
Use OPAL duration units (see working-with-intervals skill).
Pitfall 3: Not Using fill for Zeros
❌ Wrong:
aggregate total_errors:sum(errors), group_by(service_name)
Services with zero errors won't appear.
✅ Correct:
aggregate total_errors:sum(errors), group_by(service_name)
fill total_errors:0
All services shown (important for dashboards).
Pitfall 4: Wrong Metric for Latency
❌ Wrong:
align options(bins: 1), dur:avg(m("span_duration_5m"))
Don't use avg() on pre-aggregated duration.
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_*_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Use tdigest metrics for percentiles.
Related Skills
- working-with-intervals - Understanding span duration and temporal queries
- aggregating-gauge-metrics - Request and error count metrics (gauge type)
- analyzing-tdigest-metrics - Latency percentile metrics (tdigest type)
- filtering-event-datasets - Filtering techniques applicable to spans
- time-series-analysis - Time-bucketed analysis with timechart (alternative to align)
Summary
APM in Observe uses the RED methodology:
- Rate: Request counts from
span_call_count_5mmetric - Errors: Error counts from
span_error_count_5mmetric - Duration: Latency percentiles from
span_*_duration_tdigest_5mmetric
Key decision: Metrics for speed and overview, Spans for detail and investigation.
Workflow: Metrics (identify issues) → Spans (root cause analysis) → Traces (follow request flow).
Performance: Metrics scale to long time ranges, spans are best for short windows.
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (OpenTelemetry/Span + ServiceExplorer Metrics)