name	clickhouse
description	ClickHouse query optimization and schema design. Use when writing, optimizing, or reviewing ClickHouse queries and table schemas.

ClickHouse Query Optimization Skill

Guidelines for optimal ClickHouse query performance and schema design. MUST rules are critical for performance; SHOULD rules are strong recommendations; CAN rules are optional optimizations.

1 — Columnar Storage Fundamentals

COL-1 (MUST): Never use SELECT * — only select columns you need
COL-2 (MUST): Understand that each column is stored as a separate file on disk
COL-3 (SHOULD): Design queries to read minimal columns for maximum performance

Key insight: "The less data you read, the faster your queries run."

2 — Granules: The Fundamental Unit

GR-1 (MUST): Understand granules are 8,192 rows by default (configurable)
GR-2 (MUST): Know that granules are indivisible — if one row matches, all 8,192 rows are read
GR-3 (SHOULD): Design primary keys to maximize granule skipping

Key insight: Sparse indexes store the first value of every 8,192nd row. ClickHouse performs binary search to identify matching granules, then reads only those granules.

3 — Primary Key Design (ORDER BY)

This is the single most important optimization — proper ORDER BY selection can improve performance by 100× or more.

PK-1 (MUST): Include columns that appear most frequently in WHERE clauses
PK-2 (MUST): Order columns by cardinality — lower cardinality first
PK-3 (MUST): Include time components for time-series workloads
PK-4 (SHOULD): Avoid high-cardinality columns (UUIDs) as first column

Cardinality Ordering Rationale

Position	Search Algorithm	Cardinality Impact
First column	Binary search O(log₂ n)	Works efficiently regardless of cardinality
Secondary columns	Exclusion search	Effectiveness depends on predecessor cardinality

Bad: ORDER BY (user_uuid, timestamp) — UUID as first column randomizes secondary column values across granules Good: ORDER BY (tenant_id, timestamp, user_uuid) — Low cardinality first enables effective pruning

Compound Key Query Support

Key (user_id, timestamp, event_type) efficiently supports:

WHERE user_id = X — highly efficient (uses first column)
WHERE user_id = X AND timestamp > Y — efficient (uses first two columns)
WHERE timestamp > Y — less efficient (skips first column)

4 — Query Optimization Techniques

PREWHERE Clause

PRE-1 (SHOULD): Use PREWHERE for highly selective filter conditions
PRE-2 (SHOULD): Use PREWHERE when filter columns are smaller than SELECT columns

PREWHERE reads only filter condition columns first, evaluates the condition, then reads remaining columns only for matching rows.

Approximate Functions

APX-1 (SHOULD): Use uniq() instead of COUNT(DISTINCT) when approximate results are acceptable
APX-2 (CAN): Use HyperLogLog-based functions for 10-100× performance improvement

Approximate functions use ~12 KB fixed memory vs O(n) for exact counts.

Hashing Long Strings

HASH-1 (SHOULD): When grouping by long strings, hash to fixed-size integers (5-10× faster)

5 — Data Type Optimization

DT-1 (MUST): Avoid Nullable columns when possible (requires additional UInt8 null-mask column)
DT-2 (MUST): Use LowCardinality(String) for columns with <10,000 unique values
DT-3 (SHOULD): Choose smallest appropriate types (UInt8 vs Int64 saves 7 bytes per row)

LowCardinality Benefits

Dictionary encoding applied automatically
Significant compression and query performance gains
Ideal for status fields, country codes, category names

6 — Denormalization

DN-1 (SHOULD): Prefer denormalized data over JOINs
DN-2 (SHOULD): Accept increased storage for eliminated JOIN overhead

Key insight: ClickHouse performs best with denormalized data. Columnar compression mitigates storage increase while eliminating hash table construction overhead.

7 — Advanced Optimizations

Materialized Views

MV-1 (SHOULD): Use materialized views for pre-computed aggregates
MV-2 (CAN): Leverage incremental updates at insert time rather than query time

Projections

PROJ-1 (CAN): Maintain multiple physical sort orders for different query patterns
PROJ-2 (CAN): Use projections for pre-computed aggregates

Skip Indexes

SKIP-1 (SHOULD): Use skip indexes as last resort after primary key and projection optimization
SKIP-2 (CAN): Consider bloom filters for high-cardinality column filtering

Dictionaries

DICT-1 (CAN): Use in-memory dictionaries for O(1) lookup times
DICT-2 (CAN): Replace JOINs with dictionary lookups where applicable

8 — Monitoring & Debugging

MON-1 (MUST): Use system.query_log for query performance analysis
MON-2 (MUST): Use EXPLAIN indexes = 1 to verify granule selection
MON-3 (SHOULD): Monitor granule ratio (granules selected / total granules)
MON-4 (SHOULD): Measure, change one variable, measure again

Key Metrics to Monitor

Metric	Source	What to Look For
Rows read	query_log	Should be close to result set size
Granules selected	EXPLAIN	Lower ratio = better pruning
Memory usage	query_log	Watch for excessive allocation
Execution time	query_log	Track query patterns

9 — Query Execution

EX-1 (SHOULD): Understand ClickHouse uses all CPU cores by default
EX-2 (CAN): Each thread processes independent data ranges in parallel
EX-3 (CAN): Use multi-pass grouping for large GROUP BY operations

Execution Pipeline

Parsing and Analysis — Create execution plan
Optimization — Prune unnecessary data via sparse index
Pipeline Execution — Parallel processing across CPU cores
Final Processing — Merge results

Performance Comparison Example

Without Optimization (wrong ORDER BY)

-- ORDER BY (postcode, address), querying WHERE town = 'LONDON'
-- Rows processed: 27.64 million (full table scan)
-- Time: 44ms
-- Data read: 44.21 MB

With Optimization (correct ORDER BY)

-- ORDER BY (town, price)
-- Rows processed: 81,920 (selected granules only)
-- Time: 5ms (8.8× faster)
-- Data read: 13.03 MB (3.4× reduction)

Review Checklist

When reviewing ClickHouse schemas and queries:

Column Selection: Are only necessary columns selected?
ORDER BY Design: Is cardinality ordering correct (low to high)?
Time Component: Does ORDER BY include timestamp for time-series data?
Data Types: Are Nullable avoided? Is LowCardinality used appropriately?
PREWHERE: Would PREWHERE benefit selective filters?
Granule Efficiency: Does EXPLAIN show good granule skip ratio?
Denormalization: Are JOINs avoided where possible?
Approximate Functions: Could approximate functions replace exact counts?

References

When creating queries, or when you need to know more about tables, consult:

references/clickhouse_tables.md — Table schemas and query patterns
references/databases.md — Connection details and cluster architecture
references/troubleshooting.md — DDL queue, replication, and operational issues

clickhouse

Install Skill

SKILL.md