MeshWorld India Logo MeshWorld.
Cheatsheet ClickHouse Database Analytics SQL DevOps Data Engineering 9 min read

ClickHouse Cheat Sheet: Database Commands, Schema Design & Performance (2026)

Rachel
By Rachel
| Updated: May 12, 2026
ClickHouse Cheat Sheet: Database Commands, Schema Design & Performance (2026)
TL;DR
  • ClickHouse is a column-oriented OLAP database — 100× faster than MySQL/PostgreSQL for analytics on billions of rows
  • Primary Key (not unique) determines sort order — choose carefully for query patterns
  • SELECT is the main operation — INSERT is append-only; UPDATE/DELETE are mutations (expensive)
  • Array and tuple functions are first-class citizens — no need for JSON extraction
  • Sampling, prewhere, and final modifiers are ClickHouse-specific and critical for performance
  • Materialized views are the ETL layer — they pre-aggregate data on INSERT
  • Replaces MongoDB + Redis + Druid for most analytics workloads

Quick reference tables

Installation

| Platform | Command | |---|---| | macOS | brew install clickhouse | | Ubuntu / Debian | sudo apt install clickhouse-server clickhouse-client | | Docker | docker run -d --network host clickhouse/clickhouse-server | | Binary | curl https://clickhouse.com/ | sh | | Verify | clickhouse client --version |

Core CLI

| Command | What it does | |---|---| | clickhouse client | Interactive SQL shell | | clickhouse client --host localhost --port 9000 | Connect to specific host | | clickhouse client --query "SELECT 1" | Run single query | | clickhouse local | Run queries on local files without a server | | clickhouse-server | Start server daemon | | clickhouse benchmark | Run benchmarking queries | | clickhouse obfuscator | Obfuscate data for anonymization |

Data types

| Type | Example | Notes | |---|---|---| | UInt8/16/32/64 | UInt64 | Unsigned integers | | Int8/16/32/64 | Int32 | Signed integers | | Float32/64 | Float64 | IEEE 754 | | Decimal | Decimal(18,2) | Precise decimals | | String | String | Arbitrary bytes | | FixedString(N) | FixedString(16) | Fixed-length bytes | | Date | Date | Calendar date (no time) | | DateTime | DateTime | Unix timestamp | | DateTime64 | DateTime64(3) | Sub-second precision | | Enum8/16 | Enum8('red'=1,'blue'=2) | Named values | | Array(T) | Array(String) | Arrays of type T | | Tuple(A, B) | Tuple(Int32, String) | Fixed-arity heterogeneous | | Map(K, V) | Map(String, Int64) | Key-value pairs | | JSON | JSON | Dynamic JSON storage | | IPv4/6 | IPv4 | Network addresses |


Schema design

Creating tables

sql
CREATE TABLE events (
    event_id   UUID        DEFAULT generateUUIDv4(),
    user_id    UUID        NOT NULL,
    event_type String      NOT NULL,
    properties Map(String, String) DEFAULT {}(),
    revenue    Decimal(18, 2) DEFAULT 0,
    ip_address IPv4,
    created_at DateTime DEFAULT now()
) ENGINE = MergeTree
  ORDER BY (event_type, user_id, created_at)
  PARTITION BY toYYYYMM(created_at)
  SAMPLE BY user_id
  SETTINGS index_granularity = 8192;

Table engines

| Engine | Use for | |---|---| | MergeTree (default) | General-purpose — sorted data with index | | ReplacingMergeTree | Deduplication by primary key | | SummingMergeTree | Auto-sum columns on deduplication | | AggregatingMergeTree | Pre-aggregate with -State/-Merge functions | | CollapsingMergeTree | Delta-cancellation (bi-directional events) | | SummingSortedMergeTree | Time-series with auto-rollup | | Memory | In-memory (testing, small tables) | | Distributed | Sharding across multiple nodes | | Buffer | Write-buffered landing before MergeTree | | Keeper / ReplicatedMergeTree | HA with ZooKeeper alternative |

ORDER BY — the most important choice

The ORDER BY key determines:

  1. Sort order on disk (must be unique-ish for ReplacingMergeTree)
  2. Primary index granularity (every Nth row in index_granularity)
  3. Query speed for WHERE filters on leading columns
sql
-- High-cardinality events: filterable by event_type, user_id, date
ORDER BY (event_type, user_id, created_at)

-- User-centric: fast for queries per user
ORDER BY (user_id, created_at)

-- Time-series: fast for date-range + metric queries
ORDER BY (created_at, metric_name)

SELECT patterns

Basic SELECT

sql
SELECT
    user_id,
    event_type,
    count()        AS event_count,
    uniqExact(user_id) AS unique_users,
    sum(revenue)   AS total_revenue,
    avg(revenue)   AS avg_revenue
FROM events
WHERE
    created_at >= '2026-01-01'
    AND event_type IN ('purchase', 'refund')
GROUP BY user_id, event_type
ORDER BY total_revenue DESC
LIMIT 100;

ARRAY functions (ClickHouse superpower)

sql
-- Create arrays
SELECT [1, 2, 3] AS nums, ['a', 'b'] AS letters;

-- Array aggregation
SELECT groupArray(10)(user_id)   -- top 10 user_ids per group
SELECT groupArrayIf(event_type, revenue > 0)  -- conditional array

-- Array manipulation
SELECT arrayJoin([1,2,3]) AS item;          -- unnest: one row per element
SELECT arrayFilter(x -> x > 2, [1,2,3,4]); -- filter: [3,4]
SELECT arrayMap(x -> x * 2, [1,2,3]);      -- map: [2,4,6]
SELECT arraySort(arr -> -arr, [3,1,2]);     -- sort: [3,2,1]
SELECT arrayReduce('sum', [1,2,3]);         -- reduce: 6
SELECT has(['a','b','c'], 'b');             -- has: 1
SELECT arrayDifference([1,2,4]);             -- diff: [0,1,2]
SELECT arrayCumSum([1,2,3]);               -- cumsum: [1,3,6]

Window functions (ClickHouse 22.3+)

sql
SELECT
    user_id,
    created_at,
    revenue,
    sum(revenue) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
    ) AS rolling_3_revenue,
    revenue / sum(revenue) OVER (
        PARTITION BY toYYYYMM(created_at)
    ) * 100 AS pct_of_month
FROM events;

Advanced SELECT

sql
-- USING FINAL for ReplacingMergeTree (get latest row per key)
SELECT * FROM events FINAL WHERE event_type = 'login';

-- USING PREWHERE for index optimization
SELECT user_id, event_type FROM events PREWHERE event_type = 'purchase';

-- SAMPLE for representative subset (requires SAMPLE BY in table)
SELECT count() FROM events SAMPLE 0.01;  -- 1% sample

-- ASOF JOIN for time-series alignment
SELECT a.timestamp, a.price, b.bid
FROM trades ASOF LEFT JOIN bids AS b ON a.stock = b.stock AND a.timestamp >= b.timestamp;

Aggregations

Standard aggregations

sql
SELECT
    count()         AS total_rows,
    countIf(event_type = 'purchase') AS purchases,
    uniqExact(user_id) AS unique_users,
    sum(revenue)    AS total_revenue,
    avg(revenue)    AS avg_revenue,
    min(created_at) AS first_event,
    max(created_at) AS last_event,
    stddevSamp(revenue) AS std_dev_revenue
FROM events;

-State / -Merge pattern (for AggregatingMergeTree)

sql
-- Step 1: Insert into aggregating table
INSERT INTO events_agg
SELECT
    event_type,
    toStartOfDay(created_at) AS day,
    sumState(revenue)         AS revenue_agg
FROM events
GROUP BY event_type, day;

-- Step 2: Read aggregated result
SELECT event_type, day, sumMerge(revenue_agg) AS total_revenue
FROM events_agg
GROUP BY event_type, day;

Parametric aggregations

sql
-- topK(N): top N most frequent values
SELECT topK(10)(user_id) FROM events;

-- histogram(N): compute N-bin histogram
SELECT histogram(10)(revenue) FROM events;

-- retention: event funnel
SELECT
    retention(created_at = '2026-01-01')   AS day1,
    retention(created_at = '2026-01-01',
              created_at = '2026-01-02')   AS day2
FROM events
WHERE event_type = 'signup';

Insert, update, delete

Insert

sql
INSERT INTO events (user_id, event_type, revenue) VALUES
    ('uuid-1', 'purchase', 49.99),
    ('uuid-2', 'refund', -49.99);

INSERT INTO events
SELECT * FROM remote_events WHERE created_at > '2026-01-01';

Mutation (UPDATE/DELETE — expensive, run async)

sql
-- Delete: creates a lightweight mutation (async)
ALTER TABLE events DELETE WHERE created_at < '2025-01-01';

-- Update: 2026-05-14 also async
ALTER TABLE events UPDATE revenue = revenue * 0.9
WHERE event_type = 'refund' AND created_at < '2026-01-01';

-- Check mutation status
SELECT * FROM system.mutations WHERE table = 'events';

Materialized views

Real-time aggregation

sql
-- Base table
CREATE TABLE events_raw (
    event_type String,
    user_id    String,
    revenue    Decimal(18,2),
    created_at DateTime
) ENGINE = MergeTree() ORDER BY (event_type, created_at);

-- Materialized view for daily metrics
CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (event_type, day)
AS SELECT
    event_type,
    toStartOfDay(created_at) AS day,
    sum(revenue)    AS daily_revenue,
    count()         AS daily_count,
    uniqExact(user_id) AS daily_users
FROM events_raw
GROUP BY event_type, toStartOfDay(created_at);

-- Now INSERT into base table — view auto-populates
INSERT INTO events_raw VALUES ('purchase', 'user-1', 99.99, now());

Performance tuning

EXPLAIN

sql
EXPLAIN indexes = 1, actions = 1
SELECT user_id, count() FROM events WHERE event_type = 'purchase' GROUP BY user_id;

Settings that matter

sql
-- Per-query settings
SET max_rows_to_read = 10000000;
SET max_bytes_to_read = 1073741824;
SET max_execution_time = 300;
SET max_memory_usage = 8589934592;  -- 8GB
SET use_uncompressed_cache = 1;
SET prefer_localhost_replica = 1;
SET max_block_size = 65536;
SET max_threads = 16;
SET force_index_by_date = 0;  -- allow range scans not matching date

Index strategy

ClickHouse’s primary index stores one value per index_granularity rows (default 8192):

sql
-- Good: event_type as first ORDER BY key → fast WHERE on event_type
WHERE event_type = 'purchase'

-- Slow: event_type NOT in ORDER BY key → full scan
WHERE event_type != 'purchase'

-- Good: range on date (first column in ORDER BY)
WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'

-- Slow: skip column not in ORDER BY
WHERE revenue > 100 AND user_id NOT IN (...)

Distributed queries

sql
-- Create a distributed table (sharding layer)
CREATE TABLE events_distributed
ENGINE = Distributed(replicas_cluster, default, events, rand())
AS SELECT * FROM events WHERE 0=1;

-- Query across all shards
SELECT event_type, count() FROM events_distributed GROUP BY event_type;

-- Use -Global for distributed subqueries
SELECT * FROM events WHERE user_id IN (
    SELECT user_id FROM users_global WHERE country = 'IN'
);

Summary

  • ORDER BY key = primary key + sort order + index — choose for your query patterns
  • INSERT is fast; UPDATE/DELETE are async mutations (expensive)
  • ARRAY functions are first-class — use arrayJoin, arrayMap, arrayFilter before falling back to JSON
  • Materialized view + SummingMergeTree = real-time pre-aggregation on INSERT
  • EXPLAIN indexes = 1 shows which index is used — always check before tuning
  • -State/-Merge pattern for incremental aggregation across large datasets
  • Sampling (SAMPLE 0.01) and PREWHERE are ClickHouse-specific tools for speed

FAQ

How does ClickHouse compare to PostgreSQL for analytics? ClickHouse is 10–100× faster on analytical queries over billions of rows because it’s column-oriented (only reads needed columns) and uses vectorized execution. PostgreSQL is better for transactional workloads, joins, and complex relationships. Most teams use both — PostgreSQL for writes, ClickHouse for reads.

What’s the difference between PRIMARY KEY and ORDER BY? ClickHouse doesn’t have a separate PRIMARY KEY — the ORDER BY column(s) define both sort order and index. This is different from PostgreSQL where PRIMARY KEY enforces uniqueness. ClickHouse’s ORDER BY can have duplicate values.

Why is my DELETE query taking a long time? DELETE in ClickHouse is a mutation that marks rows for deletion and removes them during merges. For large batches, use TTL-based cleanup or partition drops (ALTER TABLE DETACH PARTITION). For real-time deletes, use ReplacingMergeTree with a version column.

Can ClickHouse handle updates? Yes, via ALTER TABLE UPDATE but it’s asynchronous and expensive. Design schemas to avoid needing frequent updates — append-only works best. For real-time mutable state, use a separate Redis or in-memory table for hot data and flush to ClickHouse periodically.

What’s the max data size for a single ClickHouse node? Practically unlimited — ClickHouse handles hundreds of TB per node. Sharding across multiple nodes (Distributed engine) scales linearly. For most use cases, a single well-configured node handles tens of billions of rows comfortably.