name	trino-optimizer
description	Expert assistance for optimizing Trino query performance in Treasure Data. Use this skill when users need help with slow queries, memory issues, timeouts, or performance tuning. Focuses on partition pruning, column selection, output optimization, and common performance bottlenecks.

Trino Query Optimizer

Expert assistance for optimizing Trino query performance and troubleshooting performance issues in Treasure Data environments.

When to Use This Skill

Use this skill when:

Queries are running too slowly or timing out
Encountering memory limit errors
Optimizing existing queries for better performance
Debugging PAGE_TRANSPORT_TIMEOUT errors
Reducing query costs
Analyzing query execution plans

Core Optimization Principles

1. Time-Based Partition Pruning (Most Important)

TD tables are partitioned by 1-hour buckets. Always filter on time for optimal performance.

Use TD_INTERVAL or TD_TIME_RANGE:

-- Good: Uses partition pruning
WHERE TD_INTERVAL(time, '-1d', 'JST')

-- Good: Explicit time range
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')

-- Bad: No time filter - scans entire table!
WHERE user_id = 123  -- Missing time filter

Impact: Without time filters, queries scan the entire table instead of just relevant partitions, dramatically increasing execution time and cost.

2. Column Selection

TD uses columnar storage format. Select only needed columns.

-- Good: Select specific columns
SELECT user_id, event_type, time
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

-- Bad: SELECT * reads all columns
SELECT *  -- Slower and more expensive
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

Impact: Each additional column increases I/O. Reading 10 columns vs 50 columns can make a significant performance difference.

3. Output Optimization

Use appropriate output methods for your use case.

CREATE TABLE AS (CTAS) - 5x faster than SELECT:

-- Best for large result sets
CREATE TABLE analysis_results AS
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  COUNT(*) as events
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1

INSERT INTO - For appending to existing tables:

INSERT INTO daily_summary
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  COUNT(*) as events
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
GROUP BY 1

SELECT with LIMIT - For exploratory queries:

-- Good for testing/exploration
SELECT *
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000

Impact: CTAS skips JSON serialization, directly writing to partitioned tables, providing 5x better performance.

4. REGEXP_LIKE vs Multiple LIKE Clauses

Replace chained LIKE clauses with REGEXP_LIKE for better performance.

-- Bad: Multiple LIKE clauses (slow)
WHERE (
  column LIKE '%android%'
  OR column LIKE '%ios%'
  OR column LIKE '%mobile%'
)

-- Good: Single REGEXP_LIKE (fast)
WHERE REGEXP_LIKE(column, 'android|ios|mobile')

Impact: Trino's optimizer cannot efficiently handle multiple LIKE clauses chained with OR, but can optimize REGEXP_LIKE.

5. Approximate Functions

Use APPROX_* functions for large-scale aggregations.

-- Fast: Approximate distinct count
SELECT APPROX_DISTINCT(user_id) as unique_users
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')

-- Slow: Exact distinct count (memory intensive)
SELECT COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')

Available approximate functions:

APPROX_DISTINCT(column) - Approximate unique count
APPROX_PERCENTILE(column, percentile) - Approximate percentile (e.g., 0.95 for p95)
APPROX_SET(column) - Returns HyperLogLog sketch for set operations

Impact: Approximate functions use HyperLogLog algorithm, dramatically reducing memory usage with ~2% error rate.

6. JOIN Optimization

Order joins with smaller tables first when possible.

-- Good: Small table joined to large table
SELECT l.*, s.attribute
FROM large_table l
JOIN small_lookup s ON l.id = s.id
WHERE TD_INTERVAL(l.time, '-1d', 'JST')

-- Consider: If one table is very small, use a subquery to reduce it first
SELECT e.*
FROM events e
JOIN (
  SELECT user_id
  FROM premium_users
  WHERE subscription_status = 'active'
) p ON e.user_id = p.user_id
WHERE TD_INTERVAL(e.time, '-1d', 'JST')

Tips:

Always include time filters on all tables in JOIN
Consider using IN or EXISTS for single-column lookups
Avoid FULL OUTER JOIN when possible (expensive)

7. GROUP BY Optimization

Use column positions in GROUP BY for complex expressions.

-- Good: Use column positions
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  event_type,
  COUNT(*) as cnt
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1, 2  -- Cleaner and avoids re-evaluation

-- Works but verbose:
GROUP BY TD_TIME_STRING(time, 'd!', 'JST'), event_type

8. Window Functions Optimization

Partition window functions appropriately to reduce memory usage.

-- Good: Partition by high-cardinality column
SELECT
  user_id,
  event_time,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

-- Be careful: Window over entire dataset (memory intensive)
SELECT
  event_time,
  ROW_NUMBER() OVER (ORDER BY event_time) as global_seq
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

Common Performance Issues

Issue: Query Timeout

Symptoms:

Query exceeds maximum execution time
"Query exceeded maximum time" error

Solutions:

Add or narrow time filters with TD_INTERVAL/TD_TIME_RANGE
Reduce selected columns
Use LIMIT for testing before full run
Break into smaller queries with intermediate tables
Use approximate functions instead of exact aggregations

Example fix:

-- Before: Times out
SELECT COUNT(DISTINCT user_id)
FROM events
WHERE event_type = 'click'

-- After: Much faster
SELECT APPROX_DISTINCT(user_id)
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
  AND event_type = 'click'

Issue: Memory Limit Exceeded

Symptoms:

"Query exceeded per-node memory limit"
"Query exceeded distributed memory limit"

Solutions:

Reduce data volume with time filters
Use APPROX_DISTINCT instead of COUNT(DISTINCT)
Reduce number of columns in SELECT
Limit GROUP BY cardinality
Optimize JOIN operations
Process data in smaller time chunks

Example fix:

-- Before: Memory exceeded
SELECT
  user_id,
  COUNT(DISTINCT session_id),
  COUNT(DISTINCT page_url),
  COUNT(DISTINCT referrer)
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id

-- After: Uses approximate functions
SELECT
  user_id,
  APPROX_DISTINCT(session_id) as sessions,
  APPROX_DISTINCT(page_url) as pages,
  APPROX_DISTINCT(referrer) as referrers
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id

Issue: PAGE_TRANSPORT_TIMEOUT

Symptoms:

Frequent PAGE_TRANSPORT_TIMEOUT errors
Network-related query failures

Solutions:

Narrow TD_TIME_RANGE to reduce data volume
Reduce number of columns in SELECT
Break large queries into smaller time ranges
Use CTAS instead of SELECT for large results

Example fix:

-- Before: Transporting too much data
SELECT *
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')

-- After: Process in chunks
SELECT user_id, event_type, time
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

Issue: Slow Aggregations

Symptoms:

COUNT(DISTINCT) taking very long
Large GROUP BY queries slow

Solutions:

Use APPROX_DISTINCT instead of COUNT(DISTINCT)
Filter data with time range first
Consider pre-aggregating with intermediate tables
Reduce GROUP BY dimensions

Example fix:

-- Before: Slow exact count
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  COUNT(DISTINCT user_id) as dau
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1

-- After: Fast approximate count
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  APPROX_DISTINCT(user_id) as dau
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1

Query Analysis Workflow

When optimizing a slow query, follow this workflow:

Step 1: Add EXPLAIN

EXPLAIN
SELECT ...
FROM ...
WHERE ...

Look for:

Full table scans (missing time filters)
High cardinality GROUP BY
Expensive JOINs
Missing filters

Step 2: Check Time Filters

-- Ensure time filter exists and is specific
WHERE TD_INTERVAL(time, '-1d', 'JST')
  OR TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')

Step 3: Reduce Column Count

-- Select only needed columns
SELECT user_id, event_type, time
-- Not SELECT *

Step 4: Use Approximate Functions

-- Replace COUNT(DISTINCT) with APPROX_DISTINCT
-- Replace PERCENTILE with APPROX_PERCENTILE

Step 5: Test with LIMIT

-- Test logic on small subset first
SELECT ...
FROM ...
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000

Step 6: Use CTAS for Large Results

-- For queries returning many rows
CREATE TABLE results AS
SELECT ...

Advanced Optimization Techniques

Pre-aggregation Strategy

For frequently-run queries, create pre-aggregated tables.

-- Daily aggregation job
CREATE TABLE daily_user_events AS
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  user_id,
  COUNT(*) as event_count,
  APPROX_DISTINCT(session_id) as sessions
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
GROUP BY 1, 2

-- Fast queries on pre-aggregated data
SELECT
  date,
  SUM(event_count) as total_events,
  SUM(sessions) as total_sessions
FROM daily_user_events
WHERE date >= '2024-01-01'
GROUP BY 1

Incremental Processing

Process large time ranges incrementally.

-- Instead of processing 1 year at once
-- Process day by day and INSERT INTO result table

-- Day 1
INSERT INTO monthly_summary
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

-- Day 2
INSERT INTO monthly_summary
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-2d/-1d', 'JST')

-- etc.

Materialized Views Pattern

Create and maintain summary tables for common queries.

-- Create summary table once
CREATE TABLE user_daily_summary AS
SELECT
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  user_id,
  COUNT(*) as events,
  APPROX_DISTINCT(page_url) as pages_visited
FROM events
WHERE TD_INTERVAL(time, '-30d', 'JST')
GROUP BY 1, 2

-- Query the summary (much faster)
SELECT
  date,
  AVG(events) as avg_events_per_user,
  AVG(pages_visited) as avg_pages_per_user
FROM user_daily_summary
GROUP BY 1
ORDER BY 1

Best Practices Checklist

Before running a query, verify:

Time filter added (TD_INTERVAL or TD_TIME_RANGE)
Selecting specific columns (not SELECT *)
Using APPROX_DISTINCT for unique counts
Using REGEXP_LIKE instead of multiple LIKEs
Tested with LIMIT on small dataset first
Using CTAS for large result sets
All joined tables have time filters
GROUP BY uses reasonable cardinality
Window functions partition appropriately

Cost Optimization

Reducing query execution time also reduces cost:

Time filters - Scan less data
Column selection - Read less data
Approximate functions - Use less memory
CTAS - Avoid expensive JSON serialization
Pre-aggregation - Query summary tables instead of raw data

Resources

Use EXPLAIN to analyze query plans
Monitor query execution in TD Console
Check query statistics for memory and time usage
Consider Trino 423+ features for better performance

trino-optimizer

Install Skill

SKILL.md

Trino Query Optimizer

When to Use This Skill

Core Optimization Principles

1. Time-Based Partition Pruning (Most Important)

2. Column Selection

3. Output Optimization

4. REGEXP_LIKE vs Multiple LIKE Clauses

5. Approximate Functions

6. JOIN Optimization

7. GROUP BY Optimization

8. Window Functions Optimization

Common Performance Issues

Issue: Query Timeout

Issue: Memory Limit Exceeded

Issue: PAGE_TRANSPORT_TIMEOUT

Issue: Slow Aggregations

Query Analysis Workflow

Step 1: Add EXPLAIN

Step 2: Check Time Filters

Step 3: Reduce Column Count

Step 4: Use Approximate Functions

Step 5: Test with LIMIT

Step 6: Use CTAS for Large Results

Advanced Optimization Techniques

Pre-aggregation Strategy

Incremental Processing

Materialized Views Pattern

Best Practices Checklist

Cost Optimization

Resources