Claude Code Plugins

Community-maintained marketplace

Feedback

ClickHouse query optimization and schema design. Use when writing, optimizing, or reviewing ClickHouse queries and table schemas.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name clickhouse
description ClickHouse query optimization and schema design. Use when writing, optimizing, or reviewing ClickHouse queries and table schemas.

ClickHouse Query Optimization Skill

Guidelines for optimal ClickHouse query performance and schema design. MUST rules are critical for performance; SHOULD rules are strong recommendations; CAN rules are optional optimizations.

1 — Columnar Storage Fundamentals

  • COL-1 (MUST): Never use SELECT * — only select columns you need
  • COL-2 (MUST): Understand that each column is stored as a separate file on disk
  • COL-3 (SHOULD): Design queries to read minimal columns for maximum performance

Key insight: "The less data you read, the faster your queries run."

2 — Granules: The Fundamental Unit

  • GR-1 (MUST): Understand granules are 8,192 rows by default (configurable)
  • GR-2 (MUST): Know that granules are indivisible — if one row matches, all 8,192 rows are read
  • GR-3 (SHOULD): Design primary keys to maximize granule skipping

Key insight: Sparse indexes store the first value of every 8,192nd row. ClickHouse performs binary search to identify matching granules, then reads only those granules.

3 — Primary Key Design (ORDER BY)

This is the single most important optimization — proper ORDER BY selection can improve performance by 100× or more.

  • PK-1 (MUST): Include columns that appear most frequently in WHERE clauses
  • PK-2 (MUST): Order columns by cardinality — lower cardinality first
  • PK-3 (MUST): Include time components for time-series workloads
  • PK-4 (SHOULD): Avoid high-cardinality columns (UUIDs) as first column

Cardinality Ordering Rationale

Position Search Algorithm Cardinality Impact
First column Binary search O(log₂ n) Works efficiently regardless of cardinality
Secondary columns Exclusion search Effectiveness depends on predecessor cardinality

Bad: ORDER BY (user_uuid, timestamp) — UUID as first column randomizes secondary column values across granules Good: ORDER BY (tenant_id, timestamp, user_uuid) — Low cardinality first enables effective pruning

Compound Key Query Support

Key (user_id, timestamp, event_type) efficiently supports:

  • WHERE user_id = X — highly efficient (uses first column)
  • WHERE user_id = X AND timestamp > Y — efficient (uses first two columns)
  • WHERE timestamp > Y — less efficient (skips first column)

4 — Query Optimization Techniques

PREWHERE Clause

  • PRE-1 (SHOULD): Use PREWHERE for highly selective filter conditions
  • PRE-2 (SHOULD): Use PREWHERE when filter columns are smaller than SELECT columns

PREWHERE reads only filter condition columns first, evaluates the condition, then reads remaining columns only for matching rows.

Approximate Functions

  • APX-1 (SHOULD): Use uniq() instead of COUNT(DISTINCT) when approximate results are acceptable
  • APX-2 (CAN): Use HyperLogLog-based functions for 10-100× performance improvement

Approximate functions use ~12 KB fixed memory vs O(n) for exact counts.

Hashing Long Strings

  • HASH-1 (SHOULD): When grouping by long strings, hash to fixed-size integers (5-10× faster)

5 — Data Type Optimization

  • DT-1 (MUST): Avoid Nullable columns when possible (requires additional UInt8 null-mask column)
  • DT-2 (MUST): Use LowCardinality(String) for columns with <10,000 unique values
  • DT-3 (SHOULD): Choose smallest appropriate types (UInt8 vs Int64 saves 7 bytes per row)

LowCardinality Benefits

  • Dictionary encoding applied automatically
  • Significant compression and query performance gains
  • Ideal for status fields, country codes, category names

6 — Denormalization

  • DN-1 (SHOULD): Prefer denormalized data over JOINs
  • DN-2 (SHOULD): Accept increased storage for eliminated JOIN overhead

Key insight: ClickHouse performs best with denormalized data. Columnar compression mitigates storage increase while eliminating hash table construction overhead.

7 — Advanced Optimizations

Materialized Views

  • MV-1 (SHOULD): Use materialized views for pre-computed aggregates
  • MV-2 (CAN): Leverage incremental updates at insert time rather than query time

Projections

  • PROJ-1 (CAN): Maintain multiple physical sort orders for different query patterns
  • PROJ-2 (CAN): Use projections for pre-computed aggregates

Skip Indexes

  • SKIP-1 (SHOULD): Use skip indexes as last resort after primary key and projection optimization
  • SKIP-2 (CAN): Consider bloom filters for high-cardinality column filtering

Dictionaries

  • DICT-1 (CAN): Use in-memory dictionaries for O(1) lookup times
  • DICT-2 (CAN): Replace JOINs with dictionary lookups where applicable

8 — Monitoring & Debugging

  • MON-1 (MUST): Use system.query_log for query performance analysis
  • MON-2 (MUST): Use EXPLAIN indexes = 1 to verify granule selection
  • MON-3 (SHOULD): Monitor granule ratio (granules selected / total granules)
  • MON-4 (SHOULD): Measure, change one variable, measure again

Key Metrics to Monitor

Metric Source What to Look For
Rows read query_log Should be close to result set size
Granules selected EXPLAIN Lower ratio = better pruning
Memory usage query_log Watch for excessive allocation
Execution time query_log Track query patterns

9 — Query Execution

  • EX-1 (SHOULD): Understand ClickHouse uses all CPU cores by default
  • EX-2 (CAN): Each thread processes independent data ranges in parallel
  • EX-3 (CAN): Use multi-pass grouping for large GROUP BY operations

Execution Pipeline

  1. Parsing and Analysis — Create execution plan
  2. Optimization — Prune unnecessary data via sparse index
  3. Pipeline Execution — Parallel processing across CPU cores
  4. Final Processing — Merge results

Performance Comparison Example

Without Optimization (wrong ORDER BY)

-- ORDER BY (postcode, address), querying WHERE town = 'LONDON'
-- Rows processed: 27.64 million (full table scan)
-- Time: 44ms
-- Data read: 44.21 MB

With Optimization (correct ORDER BY)

-- ORDER BY (town, price)
-- Rows processed: 81,920 (selected granules only)
-- Time: 5ms (8.8× faster)
-- Data read: 13.03 MB (3.4× reduction)

Review Checklist

When reviewing ClickHouse schemas and queries:

  1. Column Selection: Are only necessary columns selected?
  2. ORDER BY Design: Is cardinality ordering correct (low to high)?
  3. Time Component: Does ORDER BY include timestamp for time-series data?
  4. Data Types: Are Nullable avoided? Is LowCardinality used appropriately?
  5. PREWHERE: Would PREWHERE benefit selective filters?
  6. Granule Efficiency: Does EXPLAIN show good granule skip ratio?
  7. Denormalization: Are JOINs avoided where possible?
  8. Approximate Functions: Could approximate functions replace exact counts?

References

When creating queries, or when you need to know more about tables, consult:

  • references/clickhouse_tables.md — Table schemas and query patterns
  • references/databases.md — Connection details and cluster architecture
  • references/troubleshooting.md — DDL queue, replication, and operational issues