| name | analyzing-tdigest-metrics |
| description | Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill. |
Analyzing TDigest Metrics
TDigest metrics in Observe store pre-aggregated percentile data for efficient latency and duration analysis. This skill teaches the specialized pattern for querying tdigest metrics using OPAL.
When to Use This Skill
- Calculating latency percentiles (p50, p95, p99) for services or endpoints
- Analyzing request duration distributions
- Setting or tracking SLOs (Service Level Objectives) based on percentiles
- Understanding performance characteristics beyond simple averages
- Working with any metric of type
tdigest - When you need accurate percentile calculations from pre-aggregated data
Prerequisites
- Access to Observe tenant via MCP
- Understanding that tdigest metrics are pre-aggregated percentile structures
- Metric dataset with type:
tdigest - Familiarity with percentiles (p50 = median, p95 = 95th percentile, etc.)
- Use
discover_context()to find and inspect tdigest metrics
Key Concepts
What Are TDigest Metrics?
TDigest (t-digest) is a probabilistic data structure for estimating percentiles efficiently:
Pre-aggregated percentile data: Not raw values, but compressed statistical summaries
- Stores distribution information in compact form
- Enables accurate percentile calculations
- Much more efficient than storing all raw values
Why percentiles matter:
- Averages hide outliers: A service with avg 100ms might have p99 at 10 seconds
- SLOs use percentiles: "p95 latency < 500ms" is a common SLO target
- User experience: p95/p99 show what real users experience, not just average case
Common Examples:
span_sn_service_node_duration_tdigest_5m- Service-to-service latency percentilesspan_sn_service_edge_duration_tdigest_5m- Edge latency percentilesrequest_duration_tdigest_5m- Request duration percentilesdatabase_query_duration_tdigest_5m- Database query latency percentiles
CRITICAL: The Double-Combine Pattern
TDigest metrics require a special pattern that's different from gauge metrics:
# WRONG - Missing second combine ❌
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)
# CORRECT - Double-combine pattern ✅
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why the double combine?
- First
tdigest_combine(inalign): Combines tdigest data points within time buckets - Second
tdigest_combine(inaggregate): Re-combines tdigests across groups/dimensions - Then
tdigest_quantile: Calculates the actual percentile value
Pattern breakdown:
align options(bins: 1),
combined:tdigest_combine(m_tdigest("metric_name")) ← First combine
aggregate p95:tdigest_quantile(
tdigest_combine(combined), ← Second combine (NESTED!)
0.95), ← Quantile value (0.0-1.0)
group_by(service_name)
Percentile Values
Percentiles are specified as decimal values from 0.0 to 1.0:
| Percentile | Value | Meaning |
|---|---|---|
| p50 (median) | 0.50 | 50% of values are below this |
| p75 | 0.75 | 75% of values are below this |
| p90 | 0.90 | 90% of values are below this |
| p95 | 0.95 | 95% of values are below this |
| p99 | 0.99 | 99% of values are below this |
| p99.9 | 0.999 | 99.9% of values are below this |
Common SLO targets: p95 < 500ms, p99 < 1000ms
Summary vs Time-Series (Same as Gauge Metrics)
| Output Type | Pattern | Result | Pipe? |
|---|---|---|---|
| Summary | options(bins: 1) |
One row per group | NO | |
| Time-Series | 5m, 1h |
Many rows per group | YES | |
Discovery Workflow
Step 1: Search for tdigest metrics
discover_context("duration tdigest", result_type="metric")
discover_context("latency percentile", result_type="metric")
Step 2: Get detailed metric schema
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Step 3: Verify metric type
Look for: Type: tdigest (critical!)
Step 4: Note available dimensions
Used for group_by():
service_name,for_service_nameenvironment,for_environment- etc. (shown in discovery output)
Step 5: Write query Use double-combine pattern with correct dimensions
Basic Patterns
Pattern 1: Overall Percentiles (No Grouping)
Calculate percentiles across all data:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99)
Output: Single row with overall p50, p95, p99 across entire time range.
Note: Both combines present, no group_by.
Pattern 2: Percentiles Per Service
Calculate percentiles broken down by dimension:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
Output: One row per service with percentiles.
Pattern 3: Single Percentile (Common for SLOs)
Get just p95 for SLO tracking:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| sort desc(p95)
| limit 10
Output: Top 10 services by p95 latency.
Use case: Identify slowest services for optimization.
Pattern 4: Converting Units
TDigest values are often in nanoseconds - convert for readability:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50_ns:tdigest_quantile(tdigest_combine(combined), 0.50),
p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| make_col p50_ms:p50_ns / 1000000,
p95_ms:p95_ns / 1000000,
p99_ms:p99_ns / 1000000
Output: Percentiles in both nanoseconds and milliseconds.
Note: Check sample values in discover_context() to identify units.
Pattern 5: Time-Series Percentiles
Track percentiles over time buckets:
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
Output: Multiple rows per service (one per 5-minute interval).
Note: Pipe | required for time-series pattern.
Use case: Dashboard charts showing latency trends over time.
Common Use Cases
SLO Tracking: p95 Latency Under Threshold
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| make_col p95_ms:p95_ns / 1000000
| make_col slo_target:500,
meets_slo:if(p95_ms < 500, "yes", "no")
| sort desc(p95_ms)
Use case: Check which services meet p95 < 500ms SLO target.
Output: Services with SLO compliance status.
Latency Distribution Analysis
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50),
p75:tdigest_quantile(tdigest_combine(combined), 0.75),
p90:tdigest_quantile(tdigest_combine(combined), 0.90),
p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| make_col p50_ms:p50 / 1000000,
p95_ms:p95 / 1000000,
p99_ms:p99 / 1000000
Use case: Understand full latency distribution to identify outliers.
Insight: Large gap between p95 and p99 indicates inconsistent performance.
Comparing Services by Latency
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)
| limit 10
Use case: Find slowest services to prioritize optimization efforts.
Time-Series for Incident Investigation
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000
Use case: See when latency spiked during an incident.
Output: Timeline of p95 latency for specific service.
Multi-Dimension Grouping
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
group_by(service_name, environment)
| make_col p95_ms:p95 / 1000000
| sort desc(p95_ms)
Use case: Compare latency across services AND environments.
Complete Example
Scenario: You're tracking SLOs for your microservices. The target is p95 latency < 500ms and p99 latency < 1000ms for all production services.
Step 1: Discover tdigest metrics
discover_context("duration tdigest", result_type="metric")
Found: span_sn_service_node_duration_tdigest_5m (type: tdigest)
Step 2: Get metric details
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Available dimensions: service_name, environment, for_service_name
Step 3: Query for SLO compliance
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95),
p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name, environment)
| make_col p95_ms:p95_ns / 1000000,
p99_ms:p99_ns / 1000000
| make_col p95_slo:if(p95_ms < 500, "✓", "✗"),
p99_slo:if(p99_ms < 1000, "✓", "✗")
| filter environment = "production"
| sort desc(p95_ms)
Step 4: Interpret results
| service_name | environment | p95_ms | p99_ms | p95_slo | p99_slo |
|---|---|---|---|---|---|
| frontend | production | 19373.5 | 5641328.2 | ✗ | ✗ |
| featureflagservice | production | 5838.8 | 7473.9 | ✗ | ✗ |
| cartservice | production | 4136.6 | 5898.3 | ✗ | ✗ |
| productcatalogservice | production | 257.0 | 313.1 | ✓ | ✓ |
| currencyservice | production | 54.1 | 125.1 | ✓ | ✓ |
Insight: Frontend, featureflagservice, and cartservice are violating SLOs - need optimization.
Step 5: Investigate frontend latency over time
align 1h, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95),
p99:tdigest_quantile(tdigest_combine(combined), 0.99),
group_by(service_name)
| filter service_name = "frontend"
| make_col p95_ms:p95 / 1000000, p99_ms:p99 / 1000000
Output: Hourly p95/p99 trends to identify when latency degraded.
Common Pitfalls
Pitfall 1: Forgetting Second Combine
❌ Wrong (most common mistake):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(combined, 0.95)
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: TDigest requires combining twice - once in align, once in aggregate.
Error message: "the field has to be aggregated or grouped"
Pitfall 2: Using m() Instead of m_tdigest()
❌ Wrong:
align options(bins: 1), combined:tdigest_combine(m("duration_tdigest_5m"))
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("duration_tdigest_5m"))
Why: Tdigest metrics require m_tdigest() function, not m().
Check: Look for Type: tdigest in discover_context() output.
Pitfall 3: Wrong Pipe Usage (Same as Gauge)
❌ Wrong (pipe with bins:1):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
✅ Correct:
# Summary - NO pipe
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric"))
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
# Time-series - YES pipe
align 5m, combined:tdigest_combine(m_tdigest("metric"))
| aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Pitfall 4: Percentile Value Out of Range
❌ Wrong:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 95)
✅ Correct:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: Quantile values must be 0.0 to 1.0 (not 1 to 100).
Pitfall 5: Not Converting Units
❌ Wrong (values in nanoseconds, hard to read):
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Result: p95 = 14675991.25 (what unit is this?)
✅ Correct (convert to milliseconds):
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95)
| make_col p95_ms:p95_ns / 1000000
Result: p95_ms = 14.68 (clearly milliseconds)
Tip: Check sample values in discovery to identify units (19-digit numbers = nanoseconds).
Percentile Reference
Common percentiles and their meanings:
| Percentile | Decimal | Meaning | Common Use |
|---|---|---|---|
| p50 | 0.50 | Median (middle value) | Typical user experience |
| p75 | 0.75 | 75th percentile | Better than average case |
| p90 | 0.90 | 90th percentile | Catching most outliers |
| p95 | 0.95 | 95th percentile | Standard SLO target |
| p99 | 0.99 | 99th percentile | Tail latency / worst 1% |
| p99.9 | 0.999 | 99.9th percentile | Extreme outliers |
SLO best practice: Track p95 and p99, not just averages.
Unit Conversion Reference
Common time unit conversions (assuming nanoseconds):
# Nanoseconds to milliseconds (most common)
make_col value_ms:value_ns / 1000000
# Nanoseconds to seconds
make_col value_sec:value_ns / 1000000000
# Nanoseconds to microseconds
make_col value_us:value_ns / 1000
How to identify units: Check sample values in discover_context():
- 19 digits (1760201545280843522) = nanoseconds
- 13 digits (1758543367916) = milliseconds
- 10 digits (1758543367) = seconds
Best Practices
- Always use double-combine pattern - most critical rule for tdigest
- Verify metric type - must be
tdigest(notgauge) - Check units - convert nanoseconds to milliseconds for readability
- Use multiple percentiles - p50, p95, p99 show full distribution
- Calculate SLO compliance - add derived columns comparing to targets
- Sort and limit - focus on worst offenders with
sort desc() | limit 10 - Use time-series for investigation - see when latency changed
- Group by relevant dimensions - service, environment, endpoint, etc.
Related Skills
- aggregating-gauge-metrics - For count/sum/avg metrics (NOT percentiles)
- working-with-intervals - For calculating percentiles from raw interval data (slower)
- time-series-analysis - For event/interval trending with timechart
Summary
TDigest metrics enable efficient percentile calculations:
- Core pattern:
align+m_tdigest()+ doubletdigest_combine+tdigest_quantile - Critical rule: Use
tdigest_combine()TWICE (in align AND in aggregate) - Metric function:
m_tdigest()(NOTm()) - Percentile values: 0.0 to 1.0 (0.95 = p95)
- Common percentiles: p50 (median), p95 (SLO), p99 (tail latency)
- Units: Often nanoseconds - convert to milliseconds for readability
Key distinction: TDigest metrics use special double-combine pattern, while gauge metrics use simple m() + aggregate.
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Inspector Metrics)