name	trino-to-hive-migration
description	Expert guidance for migrating queries from Trino to Hive when encountering memory errors, timeouts, or performance issues. Use this skill when Trino queries fail with memory limits or when batch processing requirements make Hive a better choice.

Trino to Hive Migration Guide

Expert assistance for converting Trino queries to Hive to resolve memory errors, timeouts, and resource constraints.

When to Use This Skill

Use this skill when:

Trino queries fail with memory limit exceeded errors
Queries time out in Trino despite optimization
Processing very large datasets (months/years of data)
Running batch ETL jobs that don't require interactive speed
Encountering "Query exceeded per-node memory limit" errors
Need more stable execution for complex aggregations
Cost optimization for long-running queries

Why Hive Often Solves Memory Problems

Key Differences

Trino (Fast but Memory-Intensive):

In-memory processing for speed
Limited by node memory (per-node limits)
Best for interactive queries on moderate data volumes
Fast failures when memory exceeded

Hive with Tez (Slower but More Scalable):

TD's Hive uses Apache Tez execution engine (not traditional MapReduce)
Tez provides faster performance than classic MapReduce
Disk-based processing with memory spilling
Can handle much larger datasets
Spills to disk when memory insufficient
More fault-tolerant for large jobs
Better for batch processing

When Hive is Better

✅ Use Hive when:

Processing > 1 month of data at once
Memory errors in Trino
Complex multi-way JOINs
High cardinality GROUP BY operations
Batch ETL (scheduled daily/hourly jobs)
Query doesn't need to be interactive

❌ Use Trino when:

Interactive/ad-hoc queries
Need results quickly (< 5 minutes)
Small to moderate data volumes
Real-time dashboards

Migration Workflow

Step 1: Identify the Problem

Trino memory error example:

Query exceeded per-node memory limit of 10GB
Query exceeded distributed memory limit of 100GB

When you see this:

First try Trino optimization (see trino-optimizer skill)
If optimization doesn't help, migrate to Hive

Step 2: Convert Query Syntax

Most Trino queries work in Hive with minor changes.

Syntax Conversion Guide

Basic SELECT (Usually Compatible)

-- Trino
SELECT
  user_id,
  COUNT(*) as event_count
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id

-- Hive (same syntax)
SELECT
  user_id,
  COUNT(*) as event_count
FROM events
WHERE TD_INTERVAL(time, '-1M', 'JST')
GROUP BY user_id

Time Functions

TD_TIME_STRING (Trino only):

-- Trino
SELECT TD_TIME_STRING(time, 'd!', 'JST') as date

-- Hive: Use TD_TIME_FORMAT instead
SELECT TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date

TD_TIME_RANGE (Compatible):

-- Both Trino and Hive
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-01-31', 'JST')

TD_INTERVAL (Compatible):

-- Both Trino and Hive
WHERE TD_INTERVAL(time, '-1M', 'JST')

Approximate Functions

APPROX_DISTINCT (Compatible via Hivemall!):

-- Trino
SELECT APPROX_DISTINCT(user_id) as unique_users

-- Hive: SAME SYNTAX! approx_distinct is available via Hivemall
SELECT approx_distinct(user_id) as unique_users
-- Also available as: approx_count_distinct(user_id)
-- Uses HyperLogLog algorithm, same as Trino

-- Hive Option 2: Exact count (slower but accurate)
SELECT COUNT(DISTINCT user_id) as unique_users

-- Hive Option 3: Sample for estimation
SELECT COUNT(DISTINCT user_id) * 10 as estimated_users
FROM table_name
WHERE TD_INTERVAL(time, '-1M', 'JST')
  AND rand() < 0.1  -- 10% sample

Good news: approx_distinct() works in both Trino and Hive! Hivemall provides this function (along with approx_count_distinct() as an alias) using the same HyperLogLog algorithm, so you often don't need to change this function when migrating.

APPROX_PERCENTILE (Trino) → PERCENTILE (Hive):

-- Trino
SELECT APPROX_PERCENTILE(response_time, 0.95) as p95

-- Hive
SELECT PERCENTILE(response_time, 0.95) as p95

Array and String Functions

ARRAY_AGG (Trino) → COLLECT_LIST (Hive):

-- Trino
SELECT ARRAY_AGG(product_id) as products

-- Hive
SELECT COLLECT_LIST(product_id) as products

STRING_AGG (Trino) → CONCAT_WS + COLLECT_LIST (Hive):

-- Trino
SELECT STRING_AGG(name, ', ') as names

-- Hive
SELECT CONCAT_WS(', ', COLLECT_LIST(name)) as names

SPLIT (Compatible but different behavior):

-- Trino: Returns array
SELECT SPLIT(text, ',')[1]  -- 0-indexed

-- Hive: Returns array
SELECT SPLIT(text, ',')[0]  -- Also 0-indexed (compatible)

REGEXP Functions

REGEXP_EXTRACT (Compatible):

-- Both Trino and Hive
SELECT REGEXP_EXTRACT(url, 'product_id=([0-9]+)', 1)

REGEXP_LIKE (Trino) → RLIKE (Hive):

-- Trino
WHERE REGEXP_LIKE(column, 'pattern')

-- Hive
WHERE column RLIKE 'pattern'

Window Functions (Mostly Compatible)

-- Both support similar syntax
SELECT
  user_id,
  event_time,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) as seq,
  LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')

JOINs (Compatible)

MAPJOIN hint in Hive for small tables:

-- Trino (automatic small table optimization)
SELECT *
FROM large_table l
JOIN small_table s ON l.id = s.id

-- Hive (explicit hint for better performance)
SELECT /*+ MAPJOIN(small_table) */ *
FROM large_table l
JOIN small_table s ON l.id = s.id

CAST Functions (Mostly Compatible)

-- Both Trino and Hive
CAST(column AS BIGINT)
CAST(column AS VARCHAR)
CAST(column AS DOUBLE)

-- Trino: TRY_CAST (returns NULL on failure)
TRY_CAST(column AS BIGINT)

-- Hive: No TRY_CAST, use CAST with NULL handling
CAST(column AS BIGINT)  -- Returns NULL on failure in Hive

Complete Migration Examples

Example 1: Memory Error with Large Aggregation

Original Trino query (fails with memory error):

SELECT
  user_id,
  TD_TIME_STRING(time, 'd!', 'JST') as date,
  APPROX_DISTINCT(session_id) as sessions,
  COUNT(*) as events
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31')
GROUP BY user_id, TD_TIME_STRING(time, 'd!', 'JST')

Converted to Hive (works):

SELECT
  user_id,
  TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as date,
  approx_distinct(session_id) as sessions,  -- Hivemall - same as Trino!
  COUNT(*) as events
FROM events
WHERE TD_TIME_RANGE(time, '2024-01-01', '2024-12-31', 'JST')
GROUP BY user_id, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')

Example 2: Complex Multi-Way JOIN

Original Trino query (times out):

SELECT
  e.user_id,
  u.user_name,
  p.product_name,
  o.order_total
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id
JOIN orders o ON e.order_id = o.order_id
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
  AND REGEXP_LIKE(e.event_type, 'purchase|checkout')

Converted to Hive (works):

SELECT
  e.user_id,
  u.user_name,
  p.product_name,
  o.order_total
FROM events e
JOIN users u ON e.user_id = u.user_id
JOIN products p ON e.product_id = p.product_id
JOIN orders o ON e.order_id = o.order_id
WHERE TD_INTERVAL(e.time, '-3M', 'JST')
  AND e.event_type RLIKE 'purchase|checkout'

Example 3: High Cardinality GROUP BY

Original Trino query (memory exceeded):

SELECT
  user_id,
  session_id,
  page_url,
  referrer,
  COUNT(*) as page_views
FROM page_views
WHERE TD_INTERVAL(time, '-6M', 'JST')
GROUP BY user_id, session_id, page_url, referrer

Converted to Hive (works):

-- Same syntax, but Hive handles high cardinality better
SELECT
  user_id,
  session_id,
  page_url,
  referrer,
  COUNT(*) as page_views
FROM page_views
WHERE TD_INTERVAL(time, '-6M', 'JST')
GROUP BY user_id, session_id, page_url, referrer

Common Syntax Differences Summary

Feature	Trino	Hive
Execution engine	In-memory	Tez (optimized MapReduce)
Time formatting	`TD_TIME_STRING(time, 'd!', 'JST')`	`TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST')`
Approximate distinct	`APPROX_DISTINCT(col)`	`approx_distinct(col)` (Hivemall - compatible!)
Approximate percentile	`APPROX_PERCENTILE(col, 0.95)`	`PERCENTILE(col, 0.95)`
Array aggregation	`ARRAY_AGG(col)`	`COLLECT_LIST(col)`
String aggregation	`STRING_AGG(col, ',')`	`CONCAT_WS(',', COLLECT_LIST(col))`
Regex matching	`REGEXP_LIKE(col, 'pattern')`	`col RLIKE 'pattern'`
Try cast	`TRY_CAST(col AS type)`	`CAST(col AS type)` (returns NULL on failure)
Small table join hint	Automatic	`/+ MAPJOIN(table) /`

Migration Checklist

Before converting from Trino to Hive:

Confirm memory error or timeout in Trino
Try Trino optimization first (time filters, column reduction, APPROX functions)
Replace TD_TIME_STRING with TD_TIME_FORMAT
Keep approx_distinct as-is (compatible via Hivemall!) or use COUNT(DISTINCT) for exact counts
Replace REGEXP_LIKE with RLIKE
Replace ARRAY_AGG with COLLECT_LIST
Replace STRING_AGG with CONCAT_WS + COLLECT_LIST
Add /*+ MAPJOIN */ hints for small lookup tables
Test query on small time range first
Verify results match expected output

Note: TD's Hive uses Apache Tez for execution (faster than classic MapReduce) and includes Hivemall library for machine learning and approximate functions.

Performance Tips for Hive

Once migrated to Hive, optimize with these techniques:

1. Use MAPJOIN for Small Tables

SELECT /*+ MAPJOIN(small_table) */ *
FROM large_table l
JOIN small_table s ON l.id = s.id
WHERE TD_INTERVAL(l.time, '-1M', 'JST')

2. Enable Dynamic Partitioning

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT OVERWRITE TABLE target_table PARTITION(dt)
SELECT *, TD_TIME_FORMAT(time, 'yyyy-MM-dd', 'JST') as dt
FROM source_table
WHERE TD_INTERVAL(time, '-1M', 'JST')

3. Process in Chunks for Very Large Datasets

-- Process 1 day at a time instead of entire year
INSERT INTO summary_table
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
-- Run daily, much more stable than processing full year

4. Use LIMIT During Development

-- Test logic first
SELECT ...
FROM events
WHERE TD_INTERVAL(time, '-1d', 'JST')
LIMIT 1000

When Migration Doesn't Help

If Hive also fails or is too slow:

Break into smaller time ranges: Process monthly instead of yearly
Pre-aggregate data: Create intermediate summary tables
Reduce dimensions: Fewer GROUP BY columns
Sample data: Use rand() < 0.1 for 10% sample
Incremental processing: Process new data only, merge with historical

Switching Between Engines in TD

In TD Console:

Change "Engine" dropdown from "Presto" to "Hive"

In Digdag Workflows:

# Trino
+query_trino:
  td>: queries/analysis.sql
  engine: presto  # or trino

# Hive
+query_hive:
  td>: queries/analysis.sql
  engine: hive

In TD Toolbelt:

# Trino
td query -d database_name -T presto "SELECT ..."

# Hive
td query -d database_name -T hive "SELECT ..."

Best Practices

Start with Trino: Try optimization first before switching to Hive
Use Hive for batch jobs: Schedule heavy processing in Hive
Keep interactive queries in Trino: Better for ad-hoc analysis
Test conversions: Verify results match between engines
Document engine choice: Note why Hive was needed (for future reference)
Monitor execution: Check Hive job progress in TD Console
Set expectations: Hive takes longer but handles larger data

Common Issues After Migration

Issue: Query Still Fails in Hive

Solutions:

Reduce time range further
Process incrementally (day by day)
Reduce GROUP BY cardinality
Use sampling for estimation

Issue: Hive Much Slower Than Expected

Solutions:

Add MAPJOIN hints for small tables
Ensure time filters are present
Select only needed columns
Check if Trino optimization would work after all

Issue: Results Different Between Trino and Hive

Check:

Approximate vs exact functions
NULL handling differences
Timezone in time functions
Array/string function behavior

Examples: Memory Error Resolution

Before (Trino - Memory Error):

Query exceeded per-node memory limit

After (Hive - Success):

-- Same query structure, just use Hive engine
-- Typically 2-5x slower but completes successfully

Typical Timeline:

Trino: Fails after 5 minutes with memory error
Hive: Completes in 20-30 minutes successfully

Resources

TD Console: Switch between engines easily
Check query logs to see memory usage patterns
Use EXPLAIN in both engines to understand execution plans
Monitor long-running Hive jobs in TD Console

Install Skill

SKILL.md

Trino to Hive Migration Guide

When to Use This Skill

Why Hive Often Solves Memory Problems

Key Differences

When Hive is Better

Migration Workflow

Step 1: Identify the Problem

Step 2: Convert Query Syntax

Syntax Conversion Guide

Basic SELECT (Usually Compatible)

Time Functions

Approximate Functions

Array and String Functions

REGEXP Functions

Window Functions (Mostly Compatible)

JOINs (Compatible)

CAST Functions (Mostly Compatible)

Complete Migration Examples

Example 1: Memory Error with Large Aggregation

Example 2: Complex Multi-Way JOIN

Example 3: High Cardinality GROUP BY

Common Syntax Differences Summary

Migration Checklist

Performance Tips for Hive

1. Use MAPJOIN for Small Tables

2. Enable Dynamic Partitioning

3. Process in Chunks for Very Large Datasets

4. Use LIMIT During Development

When Migration Doesn't Help

Switching Between Engines in TD

In TD Console:

In Digdag Workflows:

In TD Toolbelt:

Best Practices

Common Issues After Migration

Issue: Query Still Fails in Hive

Issue: Hive Much Slower Than Expected

Issue: Results Different Between Trino and Hive

Examples: Memory Error Resolution

Before (Trino - Memory Error):

After (Hive - Success):

Typical Timeline:

Resources