Claude Code Plugins

Community-maintained marketplace

Feedback

window-functions-deep-dive

@rustomax/observe-community-mcp
1
0

Master OPAL window functions for row-relative calculations, rankings, and moving aggregates. Covers lag(), lead(), row_number(), rank(), dense_rank(), moving averages, first(), and last(). Use when comparing rows to neighbors, ranking within partitions, calculating rate of change, or computing time-based moving windows. CRITICAL - OPAL uses window() function wrapper, NOT SQL OVER clause.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name window-functions-deep-dive
description Master OPAL window functions for row-relative calculations, rankings, and moving aggregates. Covers lag(), lead(), row_number(), rank(), dense_rank(), moving averages, first(), and last(). Use when comparing rows to neighbors, ranking within partitions, calculating rate of change, or computing time-based moving windows. CRITICAL - OPAL uses window() function wrapper, NOT SQL OVER clause.

Window Functions Deep Dive

Overview

Window functions perform calculations across sets of rows related to the current row, without collapsing rows like aggregate does. They enable:

  • Row comparisons: Compare current row to previous/next rows (lag/lead)
  • Rankings: Assign ranks within partitions (row_number/rank/dense_rank)
  • Moving calculations: Rolling averages, sums over time windows
  • Boundary values: First/last values in windows or partitions

CRITICAL: OPAL vs SQL Syntax

❌ WRONG - SQL OVER Syntax (Does NOT Work):

SUM(val) OVER (PARTITION BY subject ORDER BY time ROWS UNBOUNDED PRECEDING)
# Error: Unknown function 'over()'

✅ CORRECT - OPAL window() Function:

window(sum(val), group_by(subject), order_by(time))

Key Difference: OPAL uses window() function wrapper, NOT SQL OVER clause. The Observe documentation sometimes shows SQL syntax for reference, but you MUST use the window() syntax in actual OPAL queries.


Window Function Categories

1. Offset Functions (lag/lead)

Access values from previous or next rows relative to current row.

Functions:

  • lag(column, offset) - Get value from N rows back
  • lead(column, offset) - Get value from N rows ahead

Syntax:

window(lag(column, N), group_by(partition_columns))
window(lead(column, N), group_by(partition_columns))

Key Points:

  • Offset defaults to 1 if not specified
  • Returns null when offset goes beyond partition boundary
  • MUST specify group_by() for partitioning (even if single partition)
  • Optional order_by() for explicit ordering (default: unspecified)

2. Ranking Functions

Assign ranks or row numbers within partitions.

Functions:

  • row_number() - Sequential unique numbers (1, 2, 3, 4, ...)
  • rank() - Ranks with gaps after ties (1, 2, 2, 4, ...)
  • dense_rank() - Ranks without gaps (1, 2, 2, 3, ...)

Syntax:

window(row_number(), group_by(partition), order_by(sort_col))
window(rank(), group_by(partition), order_by(sort_col))
window(dense_rank(), group_by(partition), order_by(sort_col))

Key Differences:

Value row_number() rank() dense_rank()
59 1 1 1
53 2 2 2
43 3 3 3
3 4 4 4
3 5 4 4
2 6 6 5

When to Use:

  • row_number(): Need unique IDs, pagination, sampling
  • rank(): Olympic-style ranking (ties share rank, gaps for fairness)
  • dense_rank(): Category ranking (no gaps, continuous numbering)

3. Aggregate Functions with Time Windows

Calculate rolling aggregates over time-based sliding windows.

Functions:

  • avg(column) - Moving average
  • sum(column) - Moving sum
  • min(column) - Moving minimum
  • max(column) - Moving maximum
  • count(column) - Moving count

Syntax:

window(aggregate_func(column), group_by(partition), order_by(time_col), frame(back:duration))

Frame Durations:

  • Minutes: 5m, 15m, 30m
  • Hours: 1h, 6h, 12h, 24h
  • Days: 1d, 7d, 30d

IMPORTANT: OPAL only supports time-based frames (back:duration), NOT row-based frames like SQL's ROWS N PRECEDING.


4. Value Functions (first/last)

Retrieve first or last value within partition or window.

Functions:

  • first(column) - First value (by order_by)
  • last(column) - Last value (by order_by)

Syntax:

# Entire partition
window(first(column), group_by(partition), order_by(sort_col))
window(last(column), group_by(partition), order_by(sort_col))

# Sliding window
window(first(column), group_by(partition), order_by(sort_col), frame(back:duration))
window(last(column), group_by(partition), order_by(sort_col), frame(back:duration))

Pattern 1: Rate of Change Detection (lag/lead)

Use Case: Detect sudden spikes or drops in metrics.

align 5m, rate:sum(m("span_call_count_5m"))
| make_col previous_rate:window(lag(rate, 1), group_by(service_name))
| make_col next_rate:window(lead(rate, 1), group_by(service_name))
| make_col rate_change:rate - previous_rate
| make_col pct_change:if(previous_rate > 0, (rate_change / previous_rate) * 100, 0)
| make_col is_spike:if(pct_change > 100 or pct_change < -50, true, false)
| filter is_spike = true
| topk 20, max(pct_change)

Result:

service_name  previous_rate  rate  pct_change  is_spike
frontend      2              50    2400%       true      # 24x increase!
frontend      50             2     -96%        true      # 96% drop
cartservice   10             80    700%        true      # 8x increase

How It Works:

  1. lag(rate, 1) gets previous time bucket's rate
  2. lead(rate, 1) gets next time bucket's rate (for forward detection)
  3. Calculate percentage change: (current - previous) / previous * 100
  4. Flag spikes: >100% increase or >50% decrease
  5. topk sorts by largest changes

When to Use:

  • Detect traffic spikes or drops
  • Identify service anomalies
  • Alert on sudden behavior changes
  • Compare bucket-to-bucket changes

Pattern 2: Top-N Per Time Bucket (Ranking)

Use Case: Find top 5 busiest services in each 5-minute window.

align 5m, rate:sum(m("span_call_count_5m"))
| aggregate total:sum(rate), group_by(service_name, _c_bucket)
| make_col rank:window(rank(), group_by(_c_bucket), order_by(desc(total)))
| filter rank <= 5
| sort asc(_c_bucket), asc(rank)

Result:

_c_bucket  service_name             total  rank
5877542    frontend-proxy           59     1
5877542    frontend                 53     2
5877542    featureflagservice       43     3
5877542    productcatalogservice    31     4
5877542    cartservice              8      5
5877543    frontend                 61     1
5877543    frontend-proxy           55     2
...

How It Works:

  1. Aggregate to get total per service per bucket
  2. rank() within each _c_bucket partition
  3. order_by(desc(total)) ranks highest first
  4. Filter to top 5
  5. Sort for readable output

When to Use:

  • Top-N queries within time windows
  • Identify busiest/slowest services per period
  • Compare rankings across time
  • Dashboard "top services" widgets

Variation - Use dense_rank() for No Gaps:

make_col dense_rank:window(dense_rank(), group_by(_c_bucket), order_by(desc(total)))

Use dense_rank() when ties should count as one rank (no gaps).


Pattern 3: Moving Average Baseline

Use Case: Detect anomalies by comparing current value to 30-minute moving average.

align 5m, rate:sum(m("span_call_count_5m"))
| make_col baseline:window(avg(rate), group_by(service_name), order_by(asc(valid_from)), frame(back:30m))
| make_col deviation:rate - baseline
| make_col pct_deviation:if(baseline > 0, (deviation / baseline) * 100, 0)
| make_col is_anomaly:if(pct_deviation > 50 or pct_deviation < -50, true, false)
| filter is_anomaly = true
| topk 20, max(pct_deviation)

Result:

service_name        rate  baseline  deviation  pct_deviation  is_anomaly
featureflagservice  150   30        120        400%           true
frontend            2     26        -24        -92%           true
cartservice         80    20        60         300%           true

How It Works:

  1. frame(back:30m) creates 30-minute sliding window
  2. avg(rate) calculates average within that window
  3. Compare current rate to baseline
  4. Flag deviations >50% above or below baseline

When to Use:

  • Anomaly detection with dynamic baseline
  • Smooth out noise in metrics
  • Detect sustained changes vs temporary blips
  • Performance monitoring (latency vs moving average)

Frame Duration Guidance:

  • back:15m - Very responsive, more false positives
  • back:30m - Balanced (recommended for most use cases)
  • back:1h - Stable baseline, less sensitive
  • back:24h - Day-over-day trending

Pattern 4: Detect Return to Baseline

Use Case: Alert when error rate exceeds initial rate by 2x for more than 1 hour.

align 5m, error_rate:sum(m("span_error_count_5m"))
| make_col first_rate:window(first(error_rate), group_by(service_name), order_by(asc(valid_from)), frame(back:1h))
| make_col current_elevated:if(error_rate > first_rate * 2, true, false)
| filter current_elevated = true
| statsby incidents:count(), avg_elevation:avg(error_rate / first_rate), group_by(service_name)
| sort desc(incidents)

Result:

service_name     incidents  avg_elevation
checkoutservice  15         3.2x          # Error rate 3.2x higher than 1h ago
paymentservice   8          2.8x
fraudservice     3          2.1x

How It Works:

  1. first(error_rate) with frame(back:1h) gets rate from 1 hour ago
  2. Compare current to first: error_rate > first_rate * 2
  3. Aggregate to count incidents and average elevation

When to Use:

  • Detect if metric returned to baseline
  • Track sustained elevations
  • SLO violations (errors above initial threshold)
  • Performance degradation alerts

Pattern 5: Smooth Noisy Metrics

Use Case: Smooth CPU usage with 10-minute moving average for charting.

align 1m, cpu:avg(m("cpu_usage_percent"))
| make_col smooth_cpu:window(avg(cpu), group_by(host), order_by(asc(valid_from)), frame(back:10m))
| pick_col valid_from, valid_to, host, cpu, smooth_cpu

Result:

valid_from     host       cpu    smooth_cpu
1763259000...  web-01     85.2   78.5       # Raw vs smoothed
1763259060...  web-01     92.1   79.2       # Spike smoothed out
1763259120...  web-01     76.4   79.8       # Averages over 10m

How It Works:

  1. Align to 1-minute buckets (high resolution)
  2. Calculate 10-minute moving average
  3. Use smooth_cpu for charting, cpu for alerts

When to Use:

  • Smooth spiky metrics for visualization
  • Reduce noise in dashboards
  • Calculate trends without losing detail
  • Separate signal from noise

Pattern 6: Consecutive Threshold Violations

Use Case: Alert only if latency exceeds SLO for 3 consecutive buckets.

align 5m, p95:percentile(duration_ms, 0.95)
| make_col prev1:window(lag(p95, 1), group_by(service_name))
| make_col prev2:window(lag(p95, 2), group_by(service_name))
| make_col all_high:if(p95 > 500 and prev1 > 500 and prev2 > 500, true, false)
| filter all_high = true
| statsby first_violation:min(valid_from), duration_mins:count() * 5, group_by(service_name)

Result:

service_name     first_violation       duration_mins
checkoutservice  2025-11-15 10:30:00   45            # Violated for 45 min
paymentservice   2025-11-15 11:15:00   20            # Violated for 20 min

How It Works:

  1. lag(p95, 1) and lag(p95, 2) get two previous buckets
  2. Check if current AND both previous exceed 500ms
  3. Aggregate to find first violation and total duration

When to Use:

  • Reduce alert fatigue from transient spikes
  • Require sustained violations before alerting
  • SLO monitoring with grace period
  • Stability checks (N consecutive failures)

Common Window Function Patterns

Quick Reference

Pattern Window Function Use Case
Bucket-to-bucket change lag(col, 1) Spike detection, rate of change
Forward prediction lead(col, 1) Pre-emptive alerts
Top-N per period rank(), dense_rank() Rankings, leaderboards
Unique row IDs row_number() Sampling, pagination
Moving average avg() + frame(back:30m) Smoothing, baseline
Rolling sum sum() + frame(back:1h) Hourly totals
Start of period first() + frame(back:24h) Day-over-day comparison
End of period last() + frame(back:24h) Final value in window

window() Syntax Reference

Complete Syntax

window(
  expression,                     # Function to apply
  group_by(partition_columns),    # Partition into groups
  order_by(sort_expression),      # Order within partitions
  frame(back:duration)            # Optional: time-based window
)

Parameters

expression (required):

  • Offset: lag(col, N), lead(col, N)
  • Ranking: row_number(), rank(), dense_rank()
  • Aggregate: avg(col), sum(col), min(col), max(col), count(col)
  • Value: first(col), last(col)

group_by() (required for most functions):

  • Partitions data into groups
  • Window function operates independently per partition
  • Example: group_by(service_name) - separate windows per service

order_by() (required for ranking, optional for others):

  • Defines row order within partitions
  • Example: order_by(asc(valid_from)) - chronological order
  • Example: order_by(desc(total)) - highest first

frame() (optional, only for aggregates and value functions):

  • Time-based sliding window: frame(back:duration)
  • Examples: frame(back:5m), frame(back:1h), frame(back:24h)
  • NOT supported: Row-based frames (ROWS N PRECEDING in SQL)

Limitations and Workarounds

❌ Limitation 1: No Row-Based Frames

Problem: Cannot specify "last 3 rows" or "next 5 rows".

SQL Syntax (Does NOT Work):

ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
ROWS 5 FOLLOWING

OPAL Reality:

  • Only time-based frames: frame(back:duration)
  • No row-count-based frames

Workaround:

  • Use fixed time window: frame(back:15m) includes ~3 rows at 5m align
  • Use multiple lag() calls: lag(col, 1), lag(col, 2), lag(col, 3)

❌ Limitation 2: No Cumulative Sums

Problem: Cannot create running totals using window functions.

SQL Syntax (Does NOT Work):

SUM(val) OVER (ORDER BY time ROWS UNBOUNDED PRECEDING)

OPAL Behavior:

window(sum(val), order_by(time))  # Returns total, NOT cumulative

Why: Without frame, sums entire partition. With frame(back:duration), sums time window.

Workaround:

  • Use subqueries for cumulative calculations
  • Process outside OPAL (in application layer)
  • Request feature from Observe team

Example - Partial Workaround:

# NOT cumulative, but moving 1-hour sum
window(sum(val), group_by(service), order_by(time), frame(back:1h))

❌ Limitation 3: No OVER Clause

Problem: SQL OVER clause syntax fails in OPAL.

SQL Syntax (Does NOT Work):

SUM(val) OVER (PARTITION BY service ORDER BY time)
LAG(val, 1) OVER (PARTITION BY service)

Error: Unknown function 'over()'

Fix: Always use window() function wrapper.

Correct OPAL:

window(sum(val), group_by(service), order_by(time))
window(lag(val, 1), group_by(service))

Decision Tree: Which Window Function?

┌─────────────────────────────────────┐
│ What do you need to calculate?     │
└─────────────────────────────────────┘
           │
           ├─ Compare to previous/next row?
           │  └─> Use: lag() or lead()
           │      Examples:
           │      - Spike detection (current vs previous)
           │      - Rate of change (delta between buckets)
           │      - Consecutive violations (check last N via multiple lags)
           │
           ├─ Assign rankings or row numbers?
           │  └─> Use: row_number(), rank(), or dense_rank()
           │      Choose based on tie handling:
           │      - Unique IDs → row_number()
           │      - Olympic ranking (gaps after ties) → rank()
           │      - Continuous numbering (no gaps) → dense_rank()
           │
           ├─ Calculate over time window?
           │  └─> Use: avg/sum/min/max + frame(back:duration)
           │      Examples:
           │      - Moving average baseline
           │      - Rolling sum (hourly totals)
           │      - Smoothing spiky metrics
           │
           └─ Get first/last value in window?
              └─> Use: first() or last()
                  - Entire partition: omit frame()
                  - Sliding window: use frame(back:duration)
                  Examples:
                  - Compare to start of day (first in 24h window)
                  - Detect if returned to baseline

Performance Considerations

Window Function Performance

Fast ✅:

  • lag() and lead() with single offset
  • row_number(), rank(), dense_rank() within small partitions
  • Time-based frames with reasonable durations (<1h)

Slower ⚠️:

  • Large partitions (millions of rows per service)
  • Very long time windows (back:30d over large datasets)
  • Multiple window functions in single query

Optimization Tips:

  1. Filter Before Windowing:
# GOOD: Filter first, then window
filter service_name = "frontend"
| make_col prev:window(lag(rate, 1), group_by(service_name))

# SLOW: Window on all services, then filter
make_col prev:window(lag(rate, 1), group_by(service_name))
| filter service_name = "frontend"
  1. Use Appropriate Frame Duration:
# GOOD: 30m for 5m buckets = ~6 rows
frame(back:30m)

# SLOW: 24h for 5m buckets = ~288 rows
frame(back:24h)
  1. Limit Output Early:
# GOOD: Filter anomalies first
filter pct_change > 100
| limit 100

# SLOW: Calculate all, limit at end
make_col pct_change:...
| limit 100

Common Mistakes and Fixes

Mistake 1: Using SQL OVER Syntax

❌ WRONG:

make_col prev:lag(rate, 1) over (partition by service_name)

Error: Unknown function 'over()'

✅ FIX:

make_col prev:window(lag(rate, 1), group_by(service_name))

Mistake 2: Forgetting group_by()

❌ WRONG:

make_col prev:window(lag(rate, 1))

Error: Undefined behavior or incorrect partitioning

✅ FIX:

# Partition by service
make_col prev:window(lag(rate, 1), group_by(service_name))

# Single partition (all rows together)
make_col prev:window(lag(rate, 1), group_by())

Mistake 3: Expecting Cumulative Sum

❌ WRONG EXPECTATION:

window(sum(value), order_by(time))  # Expecting running total

Reality: Returns total sum of entire partition for ALL rows.

✅ WORKAROUND:

# Use moving sum instead (not cumulative, but windowed)
window(sum(value), group_by(service), order_by(time), frame(back:1h))

Mistake 4: Using Row-Based Frame Syntax

❌ WRONG:

frame(rows: 3 preceding)              # No rows parameter
frame(between: 3 preceding and current)  # No SQL syntax

✅ FIX:

frame(back:15m)  # Time-based only

Cross-References

Related Skills:

  • detecting-anomalies - Uses lag() for rate-of-change detection
  • aggregating-gauge-metrics - Metrics aggregation before windowing
  • time-series-analysis - Temporal analysis with timechart
  • subquery-patterns-and-union - Advanced multi-stage calculations

Common Workflows:

  1. Spike Detection: aggregating-gauge-metrics → window-functions (lag) → detecting-anomalies
  2. Top-N Dashboards: aggregating-gauge-metrics → window-functions (rank) → filtering
  3. Baseline Comparison: aggregating-gauge-metrics → window-functions (frame) → anomaly filtering
  4. Trend Analysis: time-series-analysis → window-functions (moving avg) → visualization

Examples Summary

Example 1: Rate of Change Detection

Use Case: Find services with >100% request rate increase

align 5m, rate:sum(m("span_call_count_5m"))
| make_col prev:window(lag(rate, 1), group_by(service_name))
| make_col pct_change:if(prev > 0, (rate - prev) / prev * 100, 0)
| filter pct_change > 100

Example 2: Top 10 Services Per Time Bucket

Use Case: Busiest services in each 5-minute window

align 5m, rate:sum(m("span_call_count_5m"))
| make_col rank:window(rank(), group_by(_c_bucket), order_by(desc(rate)))
| filter rank <= 10

Example 3: Anomaly Detection with Moving Baseline

Use Case: Alert when current exceeds 30m average by 50%

align 5m, rate:sum(m("span_call_count_5m"))
| make_col baseline:window(avg(rate), group_by(service_name), order_by(asc(valid_from)), frame(back:30m))
| make_col anomaly:if(rate > baseline * 1.5, true, false)
| filter anomaly = true

Example 4: Smoothed Metric for Dashboards

Use Case: 15-minute moving average for charts

align 1m, latency:percentile(duration_ms, 0.95)
| make_col smooth_p95:window(avg(latency), group_by(service_name), order_by(asc(valid_from)), frame(back:15m))
| pick_col valid_from, valid_to, service_name, smooth_p95

Example 5: Detect Sustained Violations

Use Case: Alert if SLO violated for 3+ consecutive buckets

align 5m, p95:percentile(duration_ms, 0.95)
| make_col prev1:window(lag(p95, 1), group_by(service_name))
| make_col prev2:window(lag(p95, 2), group_by(service_name))
| make_col sustained:if(p95 > 500 and prev1 > 500 and prev2 > 500, true, false)
| filter sustained = true

Quick Syntax Reference Card

Function Basic Syntax With Frame
lag window(lag(col, 1), group_by(dim)) N/A
lead window(lead(col, 1), group_by(dim)) N/A
row_number window(row_number(), group_by(dim), order_by(col)) N/A
rank window(rank(), group_by(dim), order_by(col)) N/A
dense_rank window(dense_rank(), group_by(dim), order_by(col)) N/A
avg window(avg(col), group_by(dim)) window(avg(col), group_by(dim), order_by(time), frame(back:30m))
sum window(sum(col), group_by(dim)) window(sum(col), group_by(dim), order_by(time), frame(back:1h))
first window(first(col), group_by(dim), order_by(time)) window(first(col), group_by(dim), order_by(time), frame(back:24h))
last window(last(col), group_by(dim), order_by(time)) window(last(col), group_by(dim), order_by(time), frame(back:24h))

Remember: ALWAYS use window() wrapper. NEVER use SQL OVER clause.