- ClickHouse is a column-oriented OLAP database — 100× faster than MySQL/PostgreSQL for analytics on billions of rows
- Primary Key (not unique) determines sort order — choose carefully for query patterns
- SELECT is the main operation — INSERT is append-only; UPDATE/DELETE are mutations (expensive)
- Array and tuple functions are first-class citizens — no need for JSON extraction
- Sampling, prewhere, and final modifiers are ClickHouse-specific and critical for performance
- Materialized views are the ETL layer — they pre-aggregate data on INSERT
- Replaces MongoDB + Redis + Druid for most analytics workloads
Quick reference tables
Installation
| Platform | Command |
|---|---|
| macOS | brew install clickhouse |
| Ubuntu / Debian | sudo apt install clickhouse-server clickhouse-client |
| Docker | docker run -d --network host clickhouse/clickhouse-server |
| Binary | `curl https://clickhouse.com/ |
| Verify | clickhouse client --version |
Core CLI
| Command | What it does |
|---|---|
clickhouse client | Interactive SQL shell |
clickhouse client --host localhost --port 9000 | Connect to specific host |
clickhouse client --query "SELECT 1" | Run single query |
clickhouse local | Run queries on local files without a server |
clickhouse-server | Start server daemon |
clickhouse benchmark | Run benchmarking queries |
clickhouse obfuscator | Obfuscate data for anonymization |
Data types
| Type | Example | Notes |
|---|---|---|
UInt8/16/32/64 | UInt64 | Unsigned integers |
Int8/16/32/64 | Int32 | Signed integers |
Float32/64 | Float64 | IEEE 754 |
Decimal | Decimal(18,2) | Precise decimals |
String | String | Arbitrary bytes |
FixedString(N) | FixedString(16) | Fixed-length bytes |
Date | Date | Calendar date (no time) |
DateTime | DateTime | Unix timestamp |
DateTime64 | DateTime64(3) | Sub-second precision |
Enum8/16 | Enum8('red'=1,'blue'=2) | Named values |
Array(T) | Array(String) | Arrays of type T |
Tuple(A, B) | Tuple(Int32, String) | Fixed-arity heterogeneous |
Map(K, V) | Map(String, Int64) | Key-value pairs |
JSON | JSON | Dynamic JSON storage |
IPv4/6 | IPv4 | Network addresses |
Schema design
Creating tables
CREATE TABLE events (
event_id UUID DEFAULT generateUUIDv4(),
user_id UUID NOT NULL,
event_type String NOT NULL,
properties Map(String, String) DEFAULT {}(),
revenue Decimal(18, 2) DEFAULT 0,
ip_address IPv4,
created_at DateTime DEFAULT now()
) ENGINE = MergeTree
ORDER BY (event_type, user_id, created_at)
PARTITION BY toYYYYMM(created_at)
SAMPLE BY user_id
SETTINGS index_granularity = 8192; Table engines
| Engine | Use for |
|---|---|
| MergeTree (default) | General-purpose — sorted data with index |
| ReplacingMergeTree | Deduplication by primary key |
| SummingMergeTree | Auto-sum columns on deduplication |
| AggregatingMergeTree | Pre-aggregate with -State/-Merge functions |
| CollapsingMergeTree | Delta-cancellation (bi-directional events) |
| SummingSortedMergeTree | Time-series with auto-rollup |
| Memory | In-memory (testing, small tables) |
| Distributed | Sharding across multiple nodes |
| Buffer | Write-buffered landing before MergeTree |
| Keeper / ReplicatedMergeTree | HA with ZooKeeper alternative |
ORDER BY — the most important choice
The ORDER BY key determines:
- Sort order on disk (must be unique-ish for ReplacingMergeTree)
- Primary index granularity (every Nth row in index_granularity)
- Query speed for WHERE filters on leading columns
-- High-cardinality events: filterable by event_type, user_id, date
ORDER BY (event_type, user_id, created_at)
-- User-centric: fast for queries per user
ORDER BY (user_id, created_at)
-- Time-series: fast for date-range + metric queries
ORDER BY (created_at, metric_name) SELECT patterns
Basic SELECT
SELECT
user_id,
event_type,
count() AS event_count,
uniqExact(user_id) AS unique_users,
sum(revenue) AS total_revenue,
avg(revenue) AS avg_revenue
FROM events
WHERE
created_at >= '2026-01-01'
AND event_type IN ('purchase', 'refund')
GROUP BY user_id, event_type
ORDER BY total_revenue DESC
LIMIT 100; ARRAY functions (ClickHouse superpower)
-- Create arrays
SELECT [1, 2, 3] AS nums, ['a', 'b'] AS letters;
-- Array aggregation
SELECT groupArray(10)(user_id) -- top 10 user_ids per group
SELECT groupArrayIf(event_type, revenue > 0) -- conditional array
-- Array manipulation
SELECT arrayJoin([1,2,3]) AS item; -- unnest: one row per element
SELECT arrayFilter(x -> x > 2, [1,2,3,4]); -- filter: [3,4]
SELECT arrayMap(x -> x * 2, [1,2,3]); -- map: [2,4,6]
SELECT arraySort(arr -> -arr, [3,1,2]); -- sort: [3,2,1]
SELECT arrayReduce('sum', [1,2,3]); -- reduce: 6
SELECT has(['a','b','c'], 'b'); -- has: 1
SELECT arrayDifference([1,2,4]); -- diff: [0,1,2]
SELECT arrayCumSum([1,2,3]); -- cumsum: [1,3,6] Window functions (ClickHouse 22.3+)
SELECT
user_id,
created_at,
revenue,
sum(revenue) OVER (
PARTITION BY user_id
ORDER BY created_at
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
) AS rolling_3_revenue,
revenue / sum(revenue) OVER (
PARTITION BY toYYYYMM(created_at)
) * 100 AS pct_of_month
FROM events; Advanced SELECT
-- USING FINAL for ReplacingMergeTree (get latest row per key)
SELECT * FROM events FINAL WHERE event_type = 'login';
-- USING PREWHERE for index optimization
SELECT user_id, event_type FROM events PREWHERE event_type = 'purchase';
-- SAMPLE for representative subset (requires SAMPLE BY in table)
SELECT count() FROM events SAMPLE 0.01; -- 1% sample
-- ASOF JOIN for time-series alignment
SELECT a.timestamp, a.price, b.bid
FROM trades ASOF LEFT JOIN bids AS b ON a.stock = b.stock AND a.timestamp >= b.timestamp; Aggregations
Standard aggregations
SELECT
count() AS total_rows,
countIf(event_type = 'purchase') AS purchases,
uniqExact(user_id) AS unique_users,
sum(revenue) AS total_revenue,
avg(revenue) AS avg_revenue,
min(created_at) AS first_event,
max(created_at) AS last_event,
stddevSamp(revenue) AS std_dev_revenue
FROM events; -State / -Merge pattern (for AggregatingMergeTree)
-- Step 1: Insert into aggregating table
INSERT INTO events_agg
SELECT
event_type,
toStartOfDay(created_at) AS day,
sumState(revenue) AS revenue_agg
FROM events
GROUP BY event_type, day;
-- Step 2: Read aggregated result
SELECT event_type, day, sumMerge(revenue_agg) AS total_revenue
FROM events_agg
GROUP BY event_type, day; Parametric aggregations
-- topK(N): top N most frequent values
SELECT topK(10)(user_id) FROM events;
-- histogram(N): compute N-bin histogram
SELECT histogram(10)(revenue) FROM events;
-- retention: event funnel
SELECT
retention(created_at = '2026-01-01') AS day1,
retention(created_at = '2026-01-01',
created_at = '2026-01-02') AS day2
FROM events
WHERE event_type = 'signup'; Insert, update, delete
Insert
INSERT INTO events (user_id, event_type, revenue) VALUES
('uuid-1', 'purchase', 49.99),
('uuid-2', 'refund', -49.99);
INSERT INTO events
SELECT * FROM remote_events WHERE created_at > '2026-01-01'; Mutation (UPDATE/DELETE — expensive, run async)
-- Delete: creates a lightweight mutation (async)
ALTER TABLE events DELETE WHERE created_at < '2025-01-01';
-- Update: 2026-05-14 also async
ALTER TABLE events UPDATE revenue = revenue * 0.9
WHERE event_type = 'refund' AND created_at < '2026-01-01';
-- Check mutation status
SELECT * FROM system.mutations WHERE table = 'events'; Materialized views
Real-time aggregation
-- Base table
CREATE TABLE events_raw (
event_type String,
user_id String,
revenue Decimal(18,2),
created_at DateTime
) ENGINE = MergeTree() ORDER BY (event_type, created_at);
-- Materialized view for daily metrics
CREATE MATERIALIZED VIEW events_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(day)
ORDER BY (event_type, day)
AS SELECT
event_type,
toStartOfDay(created_at) AS day,
sum(revenue) AS daily_revenue,
count() AS daily_count,
uniqExact(user_id) AS daily_users
FROM events_raw
GROUP BY event_type, toStartOfDay(created_at);
-- Now INSERT into base table — view auto-populates
INSERT INTO events_raw VALUES ('purchase', 'user-1', 99.99, now()); Performance tuning
EXPLAIN
EXPLAIN indexes = 1, actions = 1
SELECT user_id, count() FROM events WHERE event_type = 'purchase' GROUP BY user_id; Settings that matter
-- Per-query settings
SET max_rows_to_read = 10000000;
SET max_bytes_to_read = 1073741824;
SET max_execution_time = 300;
SET max_memory_usage = 8589934592; -- 8GB
SET use_uncompressed_cache = 1;
SET prefer_localhost_replica = 1;
SET max_block_size = 65536;
SET max_threads = 16;
SET force_index_by_date = 0; -- allow range scans not matching date Index strategy
ClickHouse’s primary index stores one value per index_granularity rows (default 8192):
-- Good: event_type as first ORDER BY key → fast WHERE on event_type
WHERE event_type = 'purchase'
-- Slow: event_type NOT in ORDER BY key → full scan
WHERE event_type != 'purchase'
-- Good: range on date (first column in ORDER BY)
WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31'
-- Slow: skip column not in ORDER BY
WHERE revenue > 100 AND user_id NOT IN (...) Distributed queries
-- Create a distributed table (sharding layer)
CREATE TABLE events_distributed
ENGINE = Distributed(replicas_cluster, default, events, rand())
AS SELECT * FROM events WHERE 0=1;
-- Query across all shards
SELECT event_type, count() FROM events_distributed GROUP BY event_type;
-- Use -Global for distributed subqueries
SELECT * FROM events WHERE user_id IN (
SELECT user_id FROM users_global WHERE country = 'IN'
); Summary
ORDER BYkey = primary key + sort order + index — choose for your query patternsINSERTis fast;UPDATE/DELETEare async mutations (expensive)- ARRAY functions are first-class — use
arrayJoin,arrayMap,arrayFilterbefore falling back to JSON Materialized view+ SummingMergeTree = real-time pre-aggregation on INSERTEXPLAIN indexes = 1shows which index is used — always check before tuning-State/-Mergepattern for incremental aggregation across large datasets- Sampling (
SAMPLE 0.01) andPREWHEREare ClickHouse-specific tools for speed
FAQ
How does ClickHouse compare to PostgreSQL for analytics? ClickHouse is 10–100× faster on analytical queries over billions of rows because it’s column-oriented (only reads needed columns) and uses vectorized execution. PostgreSQL is better for transactional workloads, joins, and complex relationships. Most teams use both — PostgreSQL for writes, ClickHouse for reads.
What’s the difference between PRIMARY KEY and ORDER BY? ClickHouse doesn’t have a separate PRIMARY KEY — the ORDER BY column(s) define both sort order and index. This is different from PostgreSQL where PRIMARY KEY enforces uniqueness. ClickHouse’s ORDER BY can have duplicate values.
Why is my DELETE query taking a long time?
DELETE in ClickHouse is a mutation that marks rows for deletion and removes them during merges. For large batches, use TTL-based cleanup or partition drops (ALTER TABLE DETACH PARTITION). For real-time deletes, use ReplacingMergeTree with a version column.
Can ClickHouse handle updates? Yes, via ALTER TABLE UPDATE but it’s asynchronous and expensive. Design schemas to avoid needing frequent updates — append-only works best. For real-time mutable state, use a separate Redis or in-memory table for hot data and flush to ClickHouse periodically.
What’s the max data size for a single ClickHouse node? Practically unlimited — ClickHouse handles hundreds of TB per node. Sharding across multiple nodes (Distributed engine) scales linearly. For most use cases, a single well-configured node handles tens of billions of rows comfortably.
What to read next
- PostgreSQL Cheat Sheet — for transactional data that feeds ClickHouse
- Docker Cheat Sheet — containerize ClickHouse for local dev and CI
- GitHub Actions Cheat Sheet — ETL pipelines into ClickHouse from CI
Related Articles
Deepen your understanding with these curated continuations.
SQL Cheat Sheet: Queries, Joins, Aggregates & CTEs
The ultimate SQL reference for SELECT syntax, JOINs, subqueries, and window functions. Master CTEs, transactions, and constraints for any relational database.
PostgreSQL Cheat Sheet: SQL, Indexes, EXPLAIN & psql
Complete PostgreSQL reference — psql commands, CRUD queries, joins, indexes, EXPLAIN ANALYZE, CTEs, window functions, JSONB operators, and admin queries.
Cloudflare Workers Cheat Sheet: Edge Functions, Durable Objects & Storage
Complete Cloudflare Workers reference — Hono/fastify on the edge, Workers AI inference, KV, R2, D1, Durable Objects, queues, and deployment with Wrangler in 2026.