MeshWorld India Logo MeshWorld.
Cheatsheet ClickHouse Database Analytics SQL DevOps Data Engineering 9 min read

ClickHouse Cheat Sheet: Database Commands, Schema Design & Performance (2026)

Rachel
By Rachel
| Updated: May 12, 2026
ClickHouse Cheat Sheet: Database Commands, Schema Design & Performance (2026)
TL;DR
  • ClickHouse is a column-oriented OLAP database — 100× faster than MySQL/PostgreSQL for analytics on billions of rows
  • Primary Key (not unique) determines sort order — choose carefully for query patterns
  • SELECT is the main operation — INSERT is append-only; UPDATE/DELETE are mutations (expensive)
  • Array and tuple functions are first-class citizens — no need for JSON extraction
  • Sampling, prewhere, and final modifiers are ClickHouse-specific and critical for performance
  • Materialized views are the ETL layer — they pre-aggregate data on INSERT
  • Replaces MongoDB + Redis + Druid for most analytics workloads

Quick reference tables

Installation

PlatformCommand
macOSbrew install clickhouse
Ubuntu / Debiansudo apt install clickhouse-server clickhouse-client
Dockerdocker run -d --network host clickhouse/clickhouse-server
Binary`curl https://clickhouse.com/
Verifyclickhouse client --version

Core CLI

CommandWhat it does
clickhouse clientInteractive SQL shell
clickhouse client --host localhost --port 9000Connect to specific host
clickhouse client --query "SELECT 1"Run single query
clickhouse localRun queries on local files without a server
clickhouse-serverStart server daemon
clickhouse benchmarkRun benchmarking queries
clickhouse obfuscatorObfuscate data for anonymization

Data types

TypeExampleNotes
UInt8/16/32/64UInt64Unsigned integers
Int8/16/32/64Int32Signed integers
Float32/64Float64IEEE 754
DecimalDecimal(18,2)Precise decimals
StringStringArbitrary bytes
FixedString(N)FixedString(16)Fixed-length bytes
DateDateCalendar date (no time)
DateTimeDateTimeUnix timestamp
DateTime64DateTime64(3)Sub-second precision
Enum8/16Enum8('red'=1,'blue'=2)Named values
Array(T)Array(String)Arrays of type T
Tuple(A, B)Tuple(Int32, String)Fixed-arity heterogeneous
Map(K, V)Map(String, Int64)Key-value pairs
JSONJSONDynamic JSON storage
IPv4/6IPv4Network addresses

Schema design

Creating tables

sql
CREATE TABLE events (
    event_id   UUID        DEFAULT generateUUIDv4(),
    user_id    UUID        NOT NULL,
    event_type String      NOT NULL,
    properties Map(String, String) DEFAULT {}(),
    revenue    Decimal(18, 2) DEFAULT 0,
    ip_address IPv4,
    created_at DateTime DEFAULT now()
) ENGINE = MergeTree
  ORDER BY (event_type, user_id, created_at)
  PARTITION BY toYYYYMM(created_at)
  SAMPLE BY user_id
  SETTINGS index_granularity = 8192;

Table engines

EngineUse for
MergeTree (default)General-purpose — sorted data with index
ReplacingMergeTreeDeduplication by primary key
SummingMergeTreeAuto-sum columns on deduplication
AggregatingMergeTreePre-aggregate with -State/-Merge functions
CollapsingMergeTreeDelta-cancellation (bi-directional events)
SummingSortedMergeTreeTime-series with auto-rollup
MemoryIn-memory (testing, small tables)
DistributedSharding across multiple nodes
BufferWrite-buffered landing before MergeTree
Keeper / ReplicatedMergeTreeHA with ZooKeeper alternative

ORDER BY — the most important choice

The ORDER BY key determines:

  1. Sort order on disk (must be unique-ish for ReplacingMergeTree)
  2. Primary index granularity (every Nth row in index_granularity)
  3. Query speed for WHERE filters on leading columns
sql
-- High-cardinality events: filterable by event_type, user_id, date
ORDER BY (event_type, user_id, created_at)

-- User-centric: fast for queries per user
ORDER BY (user_id, created_at)

-- Time-series: fast for date-range + metric queries
ORDER BY (created_at, metric_name)

SELECT patterns

Basic SELECT

sql
SELECT
    user_id,
    event_type,
    count()        AS event_count,
    uniqExact(user_id) AS unique_users,
    sum(revenue)   AS total_revenue,
    avg(revenue)   AS avg_revenue
FROM events
WHERE
    created_at >= '2026-01-01'
    AND event_type IN ('purchase', 'refund')
GROUP BY user_id, event_type
ORDER BY total_revenue DESC
LIMIT 100;

ARRAY functions (ClickHouse superpower)

sql
-- Create arrays
SELECT [1, 2, 3] AS nums, ['a', 'b'] AS letters;

-- Array aggregation
SELECT groupArray(10)(user_id)   -- top 10 user_ids per group
SELECT groupArrayIf(event_type, revenue > 0)  -- conditional array

-- Array manipulation
SELECT arrayJoin([1,2,3]) AS item;          -- unnest: one row per element
SELECT arrayFilter(x -> x > 2, [1,2,3,4]); -- filter: [3,4]
SELECT arrayMap(x -> x * 2, [1,2,3]);      -- map: [2,4,6]
SELECT arraySort(arr -> -arr, [3,1,2]);     -- sort: [3,2,1]
SELECT arrayReduce('sum', [1,2,3]);         -- reduce: 6
SELECT has(['a','b','c'], 'b');             -- has: 1
SELECT arrayDifference([1,2,4]);             -- diff: [0,1,2]
SELECT arrayCumSum([1,2,3]);               -- cumsum: [1,3,6]

Window functions (ClickHouse 22.3+)

sql
SELECT
    user_id,
    created_at,
    revenue,
    sum(revenue) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
    ) AS rolling_3_revenue,
    revenue / sum(revenue) OVER (
        PARTITION BY toYYYYMM(created_at)
    ) * 100 AS pct_of_month
FROM events;

Advanced SELECT

sql
-- USING FINAL for ReplacingMergeTree (get latest row per key)
SELECT * FROM events FINAL WHERE event_type = 'login';

-- USING PREWHERE for index optimization
SELECT user_id, event_type FROM events PREWHERE event_type = 'purchase';

-- SAMPLE for representative subset (requires SAMPLE BY in table)
SELECT count() FROM events SAMPLE 0.01;  -- 1% sample

-- ASOF JOIN for time-series alignment
SELECT a.timestamp, a.price, b.bid
FROM trades ASOF LEFT JOIN bids AS b ON a.stock = b.stock AND a.timestamp >= b.timestamp;

Aggregations

Standard aggregations

sql
SELECT
    count()         AS total_rows,
    countIf(event_type = 'purchase') AS purchases,
    uniqExact(user_id) AS unique_users,
    sum(revenue)    AS total_revenue,
    avg(revenue)    AS avg_revenue,
    min(created_at) AS first_event,
    max(created_at) AS last_event,
    stddevSamp(revenue) AS std_dev_revenue
FROM events;

-State / -Merge pattern (for AggregatingMergeTree)

sql
-- Step 1: Insert into aggregating table
INSERT INTO events_agg
SELECT
    event_type,
    toStartOfDay(created_at) AS day,
    sumState(revenue)         AS revenue_agg
FROM events
GROUP BY event_type, day;

-- Step 2: Read aggregated result
SELECT event_type, day, sumMerge(revenue_agg) AS total_revenue
FROM events_agg
GROUP BY event_type, day;

Parametric aggregations

sql
-- topK(N): top N most frequent values
SELECT topK(10)(user_id) FROM events;

-- histogram(N): compute N-bin histogram
SELECT histogram(10)(revenue) FROM events;

-- retention: event funnel
SELECT
    retention(created_at = '2026-01-01')   AS day1,
    retention(created_at = '2026-01-01',
              created_at = '2026-01-02')   AS day2
FROM events
WHERE event_type = 'signup';

Insert, update, delete

Insert

sql
INSERT INTO events (user_id, event_type, revenue) VALUES
    ('uuid-1', 'purchase', 49.99),
    ('uuid-2', 'refund', -49.99);

INSERT INTO events
SELECT * FROM remote_events WHERE created_at > '2026-01-01';

Mutation (UPDATE/DELETE — expensive, run async)

sql
-- Delete: creates a lightweight mutation (async)
ALTER TABLE events DELETE WHERE created_at < '2025-01-01';

-- Update: 2026-05-14 also async
ALTER TABLE events UPDATE revenue = revenue * 0.9
WHERE event_type = 'refund' AND created_at < '2026-01-01';

-- Check mutation status
SELECT * FROM system.mutations WHERE table = 'events';

Materialized views

Real-time aggregation

sql
-- Base table
CREATE TABLE events_raw (
    event_type String,
    user_id    String,
    revenue    Decimal(18,2),
    created_at DateTime
) ENGINE = MergeTree() ORDER BY (event_type, created_at);

-- Materialized view for daily metrics
CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (event_type, day)
AS SELECT
    event_type,
    toStartOfDay(created_at) AS day,
    sum(revenue)    AS daily_revenue,
    count()         AS daily_count,
    uniqExact(user_id) AS daily_users
FROM events_raw
GROUP BY event_type, toStartOfDay(created_at);

-- Now INSERT into base table — view auto-populates
INSERT INTO events_raw VALUES ('purchase', 'user-1', 99.99, now());

Performance tuning

EXPLAIN

sql
EXPLAIN indexes = 1, actions = 1
SELECT user_id, count() FROM events WHERE event_type = 'purchase' GROUP BY user_id;

Settings that matter

sql
-- Per-query settings
SET max_rows_to_read = 10000000;
SET max_bytes_to_read = 1073741824;
SET max_execution_time = 300;
SET max_memory_usage = 8589934592;  -- 8GB
SET use_uncompressed_cache = 1;
SET prefer_localhost_replica = 1;
SET max_block_size = 65536;
SET max_threads = 16;
SET force_index_by_date = 0;  -- allow range scans not matching date

Index strategy

ClickHouse’s primary index stores one value per index_granularity rows (default 8192):

sql
-- Good: event_type as first ORDER BY key → fast WHERE on event_type
WHERE event_type = 'purchase'

-- Slow: event_type NOT in ORDER BY key → full scan
WHERE event_type != 'purchase'

-- Good: range on date (first column in ORDER BY)
WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'

-- Slow: skip column not in ORDER BY
WHERE revenue > 100 AND user_id NOT IN (...)

Distributed queries

sql
-- Create a distributed table (sharding layer)
CREATE TABLE events_distributed
ENGINE = Distributed(replicas_cluster, default, events, rand())
AS SELECT * FROM events WHERE 0=1;

-- Query across all shards
SELECT event_type, count() FROM events_distributed GROUP BY event_type;

-- Use -Global for distributed subqueries
SELECT * FROM events WHERE user_id IN (
    SELECT user_id FROM users_global WHERE country = 'IN'
);

Summary

  • ORDER BY key = primary key + sort order + index — choose for your query patterns
  • INSERT is fast; UPDATE/DELETE are async mutations (expensive)
  • ARRAY functions are first-class — use arrayJoin, arrayMap, arrayFilter before falling back to JSON
  • Materialized view + SummingMergeTree = real-time pre-aggregation on INSERT
  • EXPLAIN indexes = 1 shows which index is used — always check before tuning
  • -State/-Merge pattern for incremental aggregation across large datasets
  • Sampling (SAMPLE 0.01) and PREWHERE are ClickHouse-specific tools for speed

FAQ

How does ClickHouse compare to PostgreSQL for analytics? ClickHouse is 10–100× faster on analytical queries over billions of rows because it’s column-oriented (only reads needed columns) and uses vectorized execution. PostgreSQL is better for transactional workloads, joins, and complex relationships. Most teams use both — PostgreSQL for writes, ClickHouse for reads.

What’s the difference between PRIMARY KEY and ORDER BY? ClickHouse doesn’t have a separate PRIMARY KEY — the ORDER BY column(s) define both sort order and index. This is different from PostgreSQL where PRIMARY KEY enforces uniqueness. ClickHouse’s ORDER BY can have duplicate values.

Why is my DELETE query taking a long time? DELETE in ClickHouse is a mutation that marks rows for deletion and removes them during merges. For large batches, use TTL-based cleanup or partition drops (ALTER TABLE DETACH PARTITION). For real-time deletes, use ReplacingMergeTree with a version column.

Can ClickHouse handle updates? Yes, via ALTER TABLE UPDATE but it’s asynchronous and expensive. Design schemas to avoid needing frequent updates — append-only works best. For real-time mutable state, use a separate Redis or in-memory table for hot data and flush to ClickHouse periodically.

What’s the max data size for a single ClickHouse node? Practically unlimited — ClickHouse handles hundreds of TB per node. Sharding across multiple nodes (Distributed engine) scales linearly. For most use cases, a single well-configured node handles tens of billions of rows comfortably.