| name | trino-to-hive-migration |
| description | Expert guidance for migrating queries from Trino to Hive when encountering memory errors, timeouts, or performance issues. Use this skill when Trino queries fail with memory limits or when batch processing requirements make Hive a better choice. |
Trino to Hive Migration Guide
Expert assistance for converting Trino queries to Hive to resolve memory errors, timeouts, and resource constraints.
When to Use This Skill
Use this skill when:
- Trino queries fail with memory limit exceeded errors
- Queries time out in Trino despite optimization
- Processing very large datasets (months/years of data)
- Running batch ETL jobs that don't require interactive speed
- Encountering "Query exceeded per-node memory limit" errors
- Need more stable execution for complex aggregations
- Cost optimization for long-running queries
Why Hive Often Solves Memory Problems
Key Differences
Trino (Fast but Memory-Intensive):
- In-memory processing for speed
- Limited by node memory (per-node limits)
- Best for interactive queries on moderate data volumes
- Fast failures when memory exceeded
Hive with Tez (Slower but More Scalable):
- TD's Hive uses Apache Tez execution engine (not traditional MapReduce)
- Tez provides faster performance than classic MapReduce
- Disk-based processing with memory spilling
- Can handle much larger datasets
- Spills to disk when memory insufficient
- More fault-tolerant for large jobs
- Better for batch processing
When Hive is Better
✅ Use Hive when:
- Processing > 1 month of data at once
- Memory errors in Trino
- Complex multi-way JOINs
- High cardinality GROUP BY operations
- Batch ETL (scheduled daily/hourly jobs)
- Query doesn't need to be interactive
❌ Use Trino when:
- Interactive/ad-hoc queries
- Need results quickly (< 5 minutes)
- Small to moderate data volumes
- Real-time dashboards
Migration Workflow
Step 1: Identify the Problem
Trino memory error example:
Query exceeded per-node memory limit of 10GB
Query exceeded distributed memory limit of 100GB
When you see this:
- First try Trino optimization (see trino-optimizer skill)
- If optimization doesn't help, migrate to Hive
Step 2: Convert Query Syntax
Most Trino queries work in Hive with minor changes.
Syntax Conversion Guide
Basic SELECT (Usually Compatible)
-- Trino
SELECT
user_id,
COUNT(*) as event_count
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id
-- Hive (same syntax)
SELECT
user_id,
COUNT(*) as event_count
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id
Time Functions
TD_TIME_STRING (Trino only):
-- Trino
SELECT TD_TIME_STRING(time, 'd!', 'JST') as date
-- Hive: Use TD_TIME_FORMAT instead
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date
TD_TIME_RANGE (Compatible):
-- Both Trino and Hive
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')
TD_INTERVAL (Compatible):
-- Both Trino and Hive
WHERE TD_INTERVAL(time, '-1M', 'JST')
Approximate Functions
APPROX_DISTINCT (Compatible via Hivemall!):
-- Trino
SELECT APPROX_DISTINCT(user_id) as unique_users
-- Hive: SAME SYNTAX! approx_distinct is available via Hivemall
SELECT approx_distinct(user_id) as unique_users
-- Also available as: approx_count_distinct(user_id)
-- Uses HyperLogLog algorithm, same as Trino
-- Hive Option 2: Exact count (slower but accurate)
SELECT COUNT(DISTINCT user_id) as unique_users
-- Hive Option 3: Sample for estimation
SELECT COUNT(DISTINCT user_id) * 10 as estimated_users
FROM table_name
WHERE TD_INTERVAL(time, '-1M', 'JST')
AND rand() < 0.1 -- 10% sample
Good news: approx_distinct() works in both Trino and Hive! Hivemall provides this function (along with approx_count_distinct() as an alias) using the same HyperLogLog algorithm, so you often don't need to change this function when migrating.
APPROX_PERCENTILE (Trino) → PERCENTILE (Hive):
-- Trino
SELECT APPROX_PERCENTILE(response_time, 0.95) as p95
-- Hive
SELECT PERCENTILE(response_time, 0.95) as p95
Array and String Functions
ARRAY_AGG (Trino) → COLLECT_LIST (Hive):
-- Trino
SELECT ARRAY_AGG(product_id) as products
-- Hive
SELECT COLLECT_LIST(product_id) as products
STRING_AGG (Trino) → CONCAT_WS + COLLECT_LIST (Hive):
-- Trino
SELECT STRING_AGG(name, ', ') as names
-- Hive
SELECT CONCAT_WS(', ', COLLECT_LIST(name)) as names
SPLIT (Compatible but different behavior):
-- Trino: Returns array
SELECT SPLIT(text, ',')[1] -- 0-indexed
-- Hive: Returns array
SELECT SPLIT(text, ',')[0] -- Also 0-indexed (compatible)
REGEXP Functions
REGEXP_EXTRACT (Compatible):
-- Both Trino and Hive
SELECT REGEXP_EXTRACT(url, 'product_id=([0-9]+)', 1)
REGEXP_LIKE (Trino) → RLIKE (Hive):
-- Trino
WHERE REGEXP_LIKE(column, 'pattern')
-- Hive
WHERE column RLIKE 'pattern'
Window Functions (Mostly Compatible)
-- Both support similar syntax
SELECT
user_id,
event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
JOINs (Compatible)
MAPJOIN hint in Hive for small tables:
-- Trino (automatic small table optimization)
SELECT *
FROM large_table l
JOIN small_table s ON l.id = s.id
-- Hive (explicit hint for better performance)
SELECT /*+ MAPJOIN(small_table) */ *
FROM large_table l
JOIN small_table s ON l.id = s.id
CAST Functions (Mostly Compatible)
-- Both Trino and Hive
CAST(column AS BIGINT)
CAST(column AS VARCHAR)
CAST(column AS DOUBLE)
-- Trino: TRY_CAST (returns NULL on failure)
TRY_CAST(column AS BIGINT)
-- Hive: No TRY_CAST, use CAST with NULL handling
CAST(column AS BIGINT) -- Returns NULL on failure in Hive
Complete Migration Examples
Example 1: Memory Error with Large Aggregation
Original Trino query (fails with memory error):
SELECT
user_id,
TD_TIME_STRING(time, 'd!', 'JST') as date,
APPROX_DISTINCT(session_id) as sessions,
COUNT(*) as events
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')
GROUP BY user_id, TD_TIME_STRING(time, 'd!', 'JST')
Converted to Hive (works):
SELECT
user_id,
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
approx_distinct(session_id) as sessions, -- Hivemall - same as Trino!
COUNT(*) as events
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31', 'JST')
GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')
Example 2: Complex Multi-Way JOIN
Original Trino query (times out):
SELECT
e.user_id,
u.user_name,
p.product_name,
o.order_total
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id
JOIN orders o ON e.order_id = o.order_id
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
AND REGEXP_LIKE(e.event_type, 'purchase|checkout')
Converted to Hive (works):
SELECT
e.user_id,
u.user_name,
p.product_name,
o.order_total
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id
JOIN orders o ON e.order_id = o.order_id
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
AND e.event_type RLIKE 'purchase|checkout'
Example 3: High Cardinality GROUP BY
Original Trino query (memory exceeded):
SELECT
user_id,
session_id,
page_url,
referrer,
COUNT(*) as page_views
FROM page_views
WHERE TD_INTERVAL(time, '-6M', 'JST')
GROUP BY user_id, session_id, page_url, referrer
Converted to Hive (works):
-- Same syntax, but Hive handles high cardinality better
SELECT
user_id,
session_id,
page_url,
referrer,
COUNT(*) as page_views
FROM page_views
WHERE TD_INTERVAL(time, '-6M', 'JST')
GROUP BY user_id, session_id, page_url, referrer
Common Syntax Differences Summary
| Feature | Trino | Hive |
|---|---|---|
| Execution engine | In-memory | Tez (optimized MapReduce) |
| Time formatting | TD_TIME_STRING(time, 'd!', 'JST') |
TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') |
| Approximate distinct | APPROX_DISTINCT(col) |
approx_distinct(col) (Hivemall - compatible!) |
| Approximate percentile | APPROX_PERCENTILE(col, 0.95) |
PERCENTILE(col, 0.95) |
| Array aggregation | ARRAY_AGG(col) |
COLLECT_LIST(col) |
| String aggregation | STRING_AGG(col, ',') |
CONCAT_WS(',', COLLECT_LIST(col)) |
| Regex matching | REGEXP_LIKE(col, 'pattern') |
col RLIKE 'pattern' |
| Try cast | TRY_CAST(col AS type) |
CAST(col AS type) (returns NULL on failure) |
| Small table join hint | Automatic | /*+ MAPJOIN(table) */ |
Migration Checklist
Before converting from Trino to Hive:
- Confirm memory error or timeout in Trino
- Try Trino optimization first (time filters, column reduction, APPROX functions)
- Replace
TD_TIME_STRINGwithTD_TIME_FORMAT - Keep
approx_distinctas-is (compatible via Hivemall!) or useCOUNT(DISTINCT)for exact counts - Replace
REGEXP_LIKEwithRLIKE - Replace
ARRAY_AGGwithCOLLECT_LIST - Replace
STRING_AGGwithCONCAT_WS+COLLECT_LIST - Add
/*+ MAPJOIN */hints for small lookup tables - Test query on small time range first
- Verify results match expected output
Note: TD's Hive uses Apache Tez for execution (faster than classic MapReduce) and includes Hivemall library for machine learning and approximate functions.
Performance Tips for Hive
Once migrated to Hive, optimize with these techniques:
1. Use MAPJOIN for Small Tables
SELECT /*+ MAPJOIN(small_table) */ *
FROM large_table l
JOIN small_table s ON l.id = s.id
WHERE TD_INTERVAL(l.time, '-1M', 'JST')
2. Enable Dynamic Partitioning
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE target_table PARTITION(dt)
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
FROM source_table
WHERE TD_INTERVAL(time, '-1M', 'JST')
3. Process in Chunks for Very Large Datasets
-- Process 1 day at a time instead of entire year
INSERT INTO summary_table
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Run daily, much more stable than processing full year
4. Use LIMIT During Development
-- Test logic first
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000
When Migration Doesn't Help
If Hive also fails or is too slow:
- Break into smaller time ranges: Process monthly instead of yearly
- Pre-aggregate data: Create intermediate summary tables
- Reduce dimensions: Fewer GROUP BY columns
- Sample data: Use
rand() < 0.1for 10% sample - Incremental processing: Process new data only, merge with historical
Switching Between Engines in TD
In TD Console:
- Change "Engine" dropdown from "Presto" to "Hive"
In Digdag Workflows:
# Trino
+query_trino:
td>: queries/analysis.sql
engine: presto # or trino
# Hive
+query_hive:
td>: queries/analysis.sql
engine: hive
In TD Toolbelt:
# Trino
td query -d database_name -T presto "SELECT ..."
# Hive
td query -d database_name -T hive "SELECT ..."
Best Practices
- Start with Trino: Try optimization first before switching to Hive
- Use Hive for batch jobs: Schedule heavy processing in Hive
- Keep interactive queries in Trino: Better for ad-hoc analysis
- Test conversions: Verify results match between engines
- Document engine choice: Note why Hive was needed (for future reference)
- Monitor execution: Check Hive job progress in TD Console
- Set expectations: Hive takes longer but handles larger data
Common Issues After Migration
Issue: Query Still Fails in Hive
Solutions:
- Reduce time range further
- Process incrementally (day by day)
- Reduce GROUP BY cardinality
- Use sampling for estimation
Issue: Hive Much Slower Than Expected
Solutions:
- Add MAPJOIN hints for small tables
- Ensure time filters are present
- Select only needed columns
- Check if Trino optimization would work after all
Issue: Results Different Between Trino and Hive
Check:
- Approximate vs exact functions
- NULL handling differences
- Timezone in time functions
- Array/string function behavior
Examples: Memory Error Resolution
Before (Trino - Memory Error):
Query exceeded per-node memory limit
After (Hive - Success):
-- Same query structure, just use Hive engine
-- Typically 2-5x slower but completes successfully
Typical Timeline:
- Trino: Fails after 5 minutes with memory error
- Hive: Completes in 20-30 minutes successfully
Resources
- TD Console: Switch between engines easily
- Check query logs to see memory usage patterns
- Use EXPLAIN in both engines to understand execution plans
- Monitor long-running Hive jobs in TD Console