| name | trino-optimizer |
| description | Expert assistance for optimizing Trino query performance in Treasure Data. Use this skill when users need help with slow queries, memory issues, timeouts, or performance tuning. Focuses on partition pruning, column selection, output optimization, and common performance bottlenecks. |
Trino Query Optimizer
Expert assistance for optimizing Trino query performance and troubleshooting performance issues in Treasure Data environments.
When to Use This Skill
Use this skill when:
- Queries are running too slowly or timing out
- Encountering memory limit errors
- Optimizing existing queries for better performance
- Debugging PAGE_TRANSPORT_TIMEOUT errors
- Reducing query costs
- Analyzing query execution plans
Core Optimization Principles
1. Time-Based Partition Pruning (Most Important)
TD tables are partitioned by 1-hour buckets. Always filter on time for optimal performance.
Use TD_INTERVAL or TD_TIME_RANGE:
-- Good: Uses partition pruning
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Good: Explicit time range
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
-- Bad: No time filter - scans entire table!
WHERE user_id = 123 -- Missing time filter
Impact: Without time filters, queries scan the entire table instead of just relevant partitions, dramatically increasing execution time and cost.
2. Column Selection
TD uses columnar storage format. Select only needed columns.
-- Good: Select specific columns
SELECT user_id, event_type, time
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Bad: SELECT * reads all columns
SELECT * -- Slower and more expensive
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
Impact: Each additional column increases I/O. Reading 10 columns vs 50 columns can make a significant performance difference.
3. Output Optimization
Use appropriate output methods for your use case.
CREATE TABLE AS (CTAS) - 5x faster than SELECT:
-- Best for large result sets
CREATE TABLE analysis_results AS
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
COUNT(*) as events
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1
INSERT INTO - For appending to existing tables:
INSERT INTO daily_summary
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
COUNT(*) as events
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
GROUP BY 1
SELECT with LIMIT - For exploratory queries:
-- Good for testing/exploration
SELECT *
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000
Impact: CTAS skips JSON serialization, directly writing to partitioned tables, providing 5x better performance.
4. REGEXP_LIKE vs Multiple LIKE Clauses
Replace chained LIKE clauses with REGEXP_LIKE for better performance.
-- Bad: Multiple LIKE clauses (slow)
WHERE (
column LIKE '%android%'
OR column LIKE '%ios%'
OR column LIKE '%mobile%'
)
-- Good: Single REGEXP_LIKE (fast)
WHERE REGEXP_LIKE(column, 'android|ios|mobile')
Impact: Trino's optimizer cannot efficiently handle multiple LIKE clauses chained with OR, but can optimize REGEXP_LIKE.
5. Approximate Functions
Use APPROX_* functions for large-scale aggregations.
-- Fast: Approximate distinct count
SELECT APPROX_DISTINCT(user_id) as unique_users
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
-- Slow: Exact distinct count (memory intensive)
SELECT COUNT(DISTINCT user_id) as unique_users
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
Available approximate functions:
APPROX_DISTINCT(column)- Approximate unique countAPPROX_PERCENTILE(column, percentile)- Approximate percentile (e.g., 0.95 for p95)APPROX_SET(column)- Returns HyperLogLog sketch for set operations
Impact: Approximate functions use HyperLogLog algorithm, dramatically reducing memory usage with ~2% error rate.
6. JOIN Optimization
Order joins with smaller tables first when possible.
-- Good: Small table joined to large table
SELECT l.*, s.attribute
FROM large_table l
JOIN small_lookup s ON l.id = s.id
WHERE TD_INTERVAL(l.time, '-1d', 'JST')
-- Consider: If one table is very small, use a subquery to reduce it first
SELECT e.*
FROM events e
JOIN (
SELECT user_id
FROM premium_users
WHERE subscription_status = 'active'
) p ON e.user_id = p.user_id
WHERE TD_INTERVAL(e.time, '-1d', 'JST')
Tips:
- Always include time filters on all tables in JOIN
- Consider using IN or EXISTS for single-column lookups
- Avoid FULL OUTER JOIN when possible (expensive)
7. GROUP BY Optimization
Use column positions in GROUP BY for complex expressions.
-- Good: Use column positions
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
event_type,
COUNT(*) as cnt
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1, 2 -- Cleaner and avoids re-evaluation
-- Works but verbose:
GROUP BY TD_TIME_STRING(time, 'd!', 'JST'), event_type
8. Window Functions Optimization
Partition window functions appropriately to reduce memory usage.
-- Good: Partition by high-cardinality column
SELECT
user_id,
event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Be careful: Window over entire dataset (memory intensive)
SELECT
event_time,
ROW_NUMBER() OVER (ORDER BY event_time) as global_seq
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
Common Performance Issues
Issue: Query Timeout
Symptoms:
- Query exceeds maximum execution time
- "Query exceeded maximum time" error
Solutions:
- Add or narrow time filters with TD_INTERVAL/TD_TIME_RANGE
- Reduce selected columns
- Use LIMIT for testing before full run
- Break into smaller queries with intermediate tables
- Use approximate functions instead of exact aggregations
Example fix:
-- Before: Times out
SELECT COUNT(DISTINCT user_id)
FROM events
WHERE event_type = 'click'
-- After: Much faster
SELECT APPROX_DISTINCT(user_id)
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
AND event_type = 'click'
Issue: Memory Limit Exceeded
Symptoms:
- "Query exceeded per-node memory limit"
- "Query exceeded distributed memory limit"
Solutions:
- Reduce data volume with time filters
- Use APPROX_DISTINCT instead of COUNT(DISTINCT)
- Reduce number of columns in SELECT
- Limit GROUP BY cardinality
- Optimize JOIN operations
- Process data in smaller time chunks
Example fix:
-- Before: Memory exceeded
SELECT
user_id,
COUNT(DISTINCT session_id),
COUNT(DISTINCT page_url),
COUNT(DISTINCT referrer)
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id
-- After: Uses approximate functions
SELECT
user_id,
APPROX_DISTINCT(session_id) as sessions,
APPROX_DISTINCT(page_url) as pages,
APPROX_DISTINCT(referrer) as referrers
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id
Issue: PAGE_TRANSPORT_TIMEOUT
Symptoms:
- Frequent PAGE_TRANSPORT_TIMEOUT errors
- Network-related query failures
Solutions:
- Narrow TD_TIME_RANGE to reduce data volume
- Reduce number of columns in SELECT
- Break large queries into smaller time ranges
- Use CTAS instead of SELECT for large results
Example fix:
-- Before: Transporting too much data
SELECT *
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')
-- After: Process in chunks
SELECT user_id, event_type, time
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
Issue: Slow Aggregations
Symptoms:
- COUNT(DISTINCT) taking very long
- Large GROUP BY queries slow
Solutions:
- Use APPROX_DISTINCT instead of COUNT(DISTINCT)
- Filter data with time range first
- Consider pre-aggregating with intermediate tables
- Reduce GROUP BY dimensions
Example fix:
-- Before: Slow exact count
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
COUNT(DISTINCT user_id) as dau
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1
-- After: Fast approximate count
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
APPROX_DISTINCT(user_id) as dau
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY 1
Query Analysis Workflow
When optimizing a slow query, follow this workflow:
Step 1: Add EXPLAIN
EXPLAIN
SELECT ...
FROM ...
WHERE ...
Look for:
- Full table scans (missing time filters)
- High cardinality GROUP BY
- Expensive JOINs
- Missing filters
Step 2: Check Time Filters
-- Ensure time filter exists and is specific
WHERE TD_INTERVAL(time, '-1d', 'JST')
OR TD_TIME_RANGE(time, '2024-01-01', '2024-01-31')
Step 3: Reduce Column Count
-- Select only needed columns
SELECT user_id, event_type, time
-- Not SELECT *
Step 4: Use Approximate Functions
-- Replace COUNT(DISTINCT) with APPROX_DISTINCT
-- Replace PERCENTILE with APPROX_PERCENTILE
Step 5: Test with LIMIT
-- Test logic on small subset first
SELECT ...
FROM ...
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000
Step 6: Use CTAS for Large Results
-- For queries returning many rows
CREATE TABLE results AS
SELECT ...
Advanced Optimization Techniques
Pre-aggregation Strategy
For frequently-run queries, create pre-aggregated tables.
-- Daily aggregation job
CREATE TABLE daily_user_events AS
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
user_id,
COUNT(*) as event_count,
APPROX_DISTINCT(session_id) as sessions
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
GROUP BY 1, 2
-- Fast queries on pre-aggregated data
SELECT
date,
SUM(event_count) as total_events,
SUM(sessions) as total_sessions
FROM daily_user_events
WHERE date >= '2024-01-01'
GROUP BY 1
Incremental Processing
Process large time ranges incrementally.
-- Instead of processing 1 year at once
-- Process day by day and INSERT INTO result table
-- Day 1
INSERT INTO monthly_summary
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Day 2
INSERT INTO monthly_summary
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-2d/-1d', 'JST')
-- etc.
Materialized Views Pattern
Create and maintain summary tables for common queries.
-- Create summary table once
CREATE TABLE user_daily_summary AS
SELECT
TD_TIME_STRING(time, 'd!', 'JST') as date,
user_id,
COUNT(*) as events,
APPROX_DISTINCT(page_url) as pages_visited
FROM events
WHERE TD_INTERVAL(time, '-30d', 'JST')
GROUP BY 1, 2
-- Query the summary (much faster)
SELECT
date,
AVG(events) as avg_events_per_user,
AVG(pages_visited) as avg_pages_per_user
FROM user_daily_summary
GROUP BY 1
ORDER BY 1
Best Practices Checklist
Before running a query, verify:
- Time filter added (TD_INTERVAL or TD_TIME_RANGE)
- Selecting specific columns (not SELECT *)
- Using APPROX_DISTINCT for unique counts
- Using REGEXP_LIKE instead of multiple LIKEs
- Tested with LIMIT on small dataset first
- Using CTAS for large result sets
- All joined tables have time filters
- GROUP BY uses reasonable cardinality
- Window functions partition appropriately
Cost Optimization
Reducing query execution time also reduces cost:
- Time filters - Scan less data
- Column selection - Read less data
- Approximate functions - Use less memory
- CTAS - Avoid expensive JSON serialization
- Pre-aggregation - Query summary tables instead of raw data
Resources
- Use EXPLAIN to analyze query plans
- Monitor query execution in TD Console
- Check query statistics for memory and time usage
- Consider Trino 423+ features for better performance