| name | clickhouse |
| description | ClickHouse query optimization and schema design. Use when writing, optimizing, or reviewing ClickHouse queries and table schemas. |
ClickHouse Query Optimization Skill
Guidelines for optimal ClickHouse query performance and schema design. MUST rules are critical for performance; SHOULD rules are strong recommendations; CAN rules are optional optimizations.
1 — Columnar Storage Fundamentals
- COL-1 (MUST): Never use
SELECT *— only select columns you need - COL-2 (MUST): Understand that each column is stored as a separate file on disk
- COL-3 (SHOULD): Design queries to read minimal columns for maximum performance
Key insight: "The less data you read, the faster your queries run."
2 — Granules: The Fundamental Unit
- GR-1 (MUST): Understand granules are 8,192 rows by default (configurable)
- GR-2 (MUST): Know that granules are indivisible — if one row matches, all 8,192 rows are read
- GR-3 (SHOULD): Design primary keys to maximize granule skipping
Key insight: Sparse indexes store the first value of every 8,192nd row. ClickHouse performs binary search to identify matching granules, then reads only those granules.
3 — Primary Key Design (ORDER BY)
This is the single most important optimization — proper ORDER BY selection can improve performance by 100× or more.
- PK-1 (MUST): Include columns that appear most frequently in WHERE clauses
- PK-2 (MUST): Order columns by cardinality — lower cardinality first
- PK-3 (MUST): Include time components for time-series workloads
- PK-4 (SHOULD): Avoid high-cardinality columns (UUIDs) as first column
Cardinality Ordering Rationale
| Position | Search Algorithm | Cardinality Impact |
|---|---|---|
| First column | Binary search O(log₂ n) | Works efficiently regardless of cardinality |
| Secondary columns | Exclusion search | Effectiveness depends on predecessor cardinality |
Bad: ORDER BY (user_uuid, timestamp) — UUID as first column randomizes secondary column values across granules
Good: ORDER BY (tenant_id, timestamp, user_uuid) — Low cardinality first enables effective pruning
Compound Key Query Support
Key (user_id, timestamp, event_type) efficiently supports:
WHERE user_id = X— highly efficient (uses first column)WHERE user_id = X AND timestamp > Y— efficient (uses first two columns)WHERE timestamp > Y— less efficient (skips first column)
4 — Query Optimization Techniques
PREWHERE Clause
- PRE-1 (SHOULD): Use PREWHERE for highly selective filter conditions
- PRE-2 (SHOULD): Use PREWHERE when filter columns are smaller than SELECT columns
PREWHERE reads only filter condition columns first, evaluates the condition, then reads remaining columns only for matching rows.
Approximate Functions
- APX-1 (SHOULD): Use
uniq()instead ofCOUNT(DISTINCT)when approximate results are acceptable - APX-2 (CAN): Use HyperLogLog-based functions for 10-100× performance improvement
Approximate functions use ~12 KB fixed memory vs O(n) for exact counts.
Hashing Long Strings
- HASH-1 (SHOULD): When grouping by long strings, hash to fixed-size integers (5-10× faster)
5 — Data Type Optimization
- DT-1 (MUST): Avoid Nullable columns when possible (requires additional UInt8 null-mask column)
- DT-2 (MUST): Use
LowCardinality(String)for columns with <10,000 unique values - DT-3 (SHOULD): Choose smallest appropriate types (UInt8 vs Int64 saves 7 bytes per row)
LowCardinality Benefits
- Dictionary encoding applied automatically
- Significant compression and query performance gains
- Ideal for status fields, country codes, category names
6 — Denormalization
- DN-1 (SHOULD): Prefer denormalized data over JOINs
- DN-2 (SHOULD): Accept increased storage for eliminated JOIN overhead
Key insight: ClickHouse performs best with denormalized data. Columnar compression mitigates storage increase while eliminating hash table construction overhead.
7 — Advanced Optimizations
Materialized Views
- MV-1 (SHOULD): Use materialized views for pre-computed aggregates
- MV-2 (CAN): Leverage incremental updates at insert time rather than query time
Projections
- PROJ-1 (CAN): Maintain multiple physical sort orders for different query patterns
- PROJ-2 (CAN): Use projections for pre-computed aggregates
Skip Indexes
- SKIP-1 (SHOULD): Use skip indexes as last resort after primary key and projection optimization
- SKIP-2 (CAN): Consider bloom filters for high-cardinality column filtering
Dictionaries
- DICT-1 (CAN): Use in-memory dictionaries for O(1) lookup times
- DICT-2 (CAN): Replace JOINs with dictionary lookups where applicable
8 — Monitoring & Debugging
- MON-1 (MUST): Use
system.query_logfor query performance analysis - MON-2 (MUST): Use
EXPLAIN indexes = 1to verify granule selection - MON-3 (SHOULD): Monitor granule ratio (granules selected / total granules)
- MON-4 (SHOULD): Measure, change one variable, measure again
Key Metrics to Monitor
| Metric | Source | What to Look For |
|---|---|---|
| Rows read | query_log | Should be close to result set size |
| Granules selected | EXPLAIN | Lower ratio = better pruning |
| Memory usage | query_log | Watch for excessive allocation |
| Execution time | query_log | Track query patterns |
9 — Query Execution
- EX-1 (SHOULD): Understand ClickHouse uses all CPU cores by default
- EX-2 (CAN): Each thread processes independent data ranges in parallel
- EX-3 (CAN): Use multi-pass grouping for large GROUP BY operations
Execution Pipeline
- Parsing and Analysis — Create execution plan
- Optimization — Prune unnecessary data via sparse index
- Pipeline Execution — Parallel processing across CPU cores
- Final Processing — Merge results
Performance Comparison Example
Without Optimization (wrong ORDER BY)
-- ORDER BY (postcode, address), querying WHERE town = 'LONDON'
-- Rows processed: 27.64 million (full table scan)
-- Time: 44ms
-- Data read: 44.21 MB
With Optimization (correct ORDER BY)
-- ORDER BY (town, price)
-- Rows processed: 81,920 (selected granules only)
-- Time: 5ms (8.8× faster)
-- Data read: 13.03 MB (3.4× reduction)
Review Checklist
When reviewing ClickHouse schemas and queries:
- Column Selection: Are only necessary columns selected?
- ORDER BY Design: Is cardinality ordering correct (low to high)?
- Time Component: Does ORDER BY include timestamp for time-series data?
- Data Types: Are Nullable avoided? Is LowCardinality used appropriately?
- PREWHERE: Would PREWHERE benefit selective filters?
- Granule Efficiency: Does EXPLAIN show good granule skip ratio?
- Denormalization: Are JOINs avoided where possible?
- Approximate Functions: Could approximate functions replace exact counts?
References
When creating queries, or when you need to know more about tables, consult:
references/clickhouse_tables.md— Table schemas and query patternsreferences/databases.md— Connection details and cluster architecturereferences/troubleshooting.md— DDL queue, replication, and operational issues