
Though ClickHouse is very fast, it is still sometimes necessary to pre-compute the results to common queries. This may simply be to speed up a particularly complex query for your users. However, reducing query latency is not the only motivation to pre-calculate. For instance, strategically materialising results could also help to serve more concurrent users and reduce load on the database, allowing you to scale your system better.
With other databases, teams will usually make use of external ETL tools and techniques to build their pre-calculated results and aggregations. This means that another tool has to be adopted and maintained. Often these ETL processes will run in a scheduled batch, meaning that your reports and dashboards will see delayed results which may only update once per hour or day.
ClickHouse on the other hand has a few native features which will allow us to pre-compute aggregates. These include incremental materialsied views, refreshable materialised views and projections, the latter of which we will focus on and demonstrate here.
The most popular and original type of views in ClickHouse were incremental materialised views. These views are automatically updated each time data is inserted into some base table. This means that the view is continually kept up to date in real time, and that we never have to executed a complex query over your entire batch of data. ClickHouse materialised views are very powerful and one of the main differentiators of ClickHouse.
More recently, ClickHouse also introduced the idea of refreshable materialised views. These are updated on a schedule and would typically rebuild over the entire dataset.
There is a slightly lesser known feature for acheiveing the same thing in ClickHouse called projections. Projections are essentially materialised views which ClickHouse manages for us. When we define a projection then ClickHouse will keep it up to date for us in the background. When we query the table, ClickHouse sees if there is a projection available which would speed up the query, then use that on our behalf. This means we get many of the benefits of the materialsied view for free.
We can define a projection on an existing table like this:
alter table uk_price_paid add projection price_agg_projection (
select
locality,
town,
district,
county,
avg(price),
min(price),
max(price)
group by
locality,
town,
district,
county
);
After defining the projection on a table with existing data, we need to explicitly materialize it to populate the results:
alter table uk_price_paid materialize projection price_agg_projection;
If we then query our original base table with a query that matches the projecftion:
SELECT
locality,
town,
district,
county,
avg(price),
min(price),
max(price)
FROM uk_price_paid
GROUP BY
locality,
town,
district,
county
LIMIT 10
Query id: 0bf25c2a-7af3-4d6e-b7e7-bf05f341fb50
┌─locality──────────┬─town──────────┬─district────────────┬─county─────────────┬─────────avg(price)─┬─min(price)─┬─max(price)─┐
1. │ WALKHAMPTON │ YELVERTON │ WEST DEVON │ DEVON │ 84589.47368421052 │ 6000 │ 285000 │
2. │ SNEYD PARK │ BRISTOL │ BRISTOL │ AVON │ 141666.66666666666 │ 120000 │ 170000 │
3. │ EDALE │ HOPE VALLEY │ HIGH PEAK │ DERBYSHIRE │ 123950 │ 57000 │ 275000 │
4. │ ST CLEARS │ CARMARTHEN │ CARMARTHEN │ DYFED │ 40166.666666666664 │ 13500 │ 64500 │
5. │ BROOKLANDS │ SALE │ MANCHESTER │ GREATER MANCHESTER │ 88054.62962962964 │ 47000 │ 132500 │
6. │ MARSHSIDE │ CANTERBURY │ CANTERBURY │ KENT │ 177277.77777777778 │ 53000 │ 395000 │
7. │ SUTTON │ ELY │ EAST CAMBRIDGESHIRE │ CAMBRIDGESHIRE │ 70685.10765550239 │ 10000 │ 283100 │
8. │ CROPREDY │ BANBURY │ CHERWELL │ OXFORDSHIRE │ 122168.01162790698 │ 20000 │ 339950 │
9. │ CROOKLANDS │ MILNTHORPE │ SOUTH LAKELAND │ CUMBRIA │ 77063.04347826086 │ 33000 │ 180000 │
10. │ CLEOBURY MORTIMER │ KIDDERMINSTER │ NORTH SHROPSHIRE │ SHROPSHIRE │ 195000 │ 195000 │ 195000 │
└───────────────────┴───────────────┴─────────────────────┴────────────────────┴────────────────────┴────────────┴────────────┘
10 rows in set. Elapsed: 0.032 sec. Processed 41.64 thousand rows, 3.11 MB (1.28 million rows/s., 95.93 MB/s.)
Peak memory usage: 15.73 MiB.
We can confirm that the projection was targetted by interrogating the query plan:
EXPLAIN
SELECT
locality,
town,
district,
county,
avg(price),
min(price),
max(price)
FROM uk_price_paid
GROUP BY
locality,
town,
district,
county
LIMIT 10
Query id: 38e05ad7-999e-452a-b932-dda89e6555fd
┌─explain──────────────────────────────────────────┐
1. │ Expression ((Project names + Projection)) │
2. │ Limit (preliminary LIMIT (without OFFSET)) │
3. │ Aggregating │
4. │ Expression │
5. │ ReadFromMergeTree (price_agg_projection) │
└──────────────────────────────────────────────────┘
One of the main uses of projections is to store data in different orders. As we know, ClickHouse performance is very tied to the order of it’s data being stored in a way which matches common query patterns. By using projections, we can store tables.
I initially didn’t like the sound of this as it feels wasteful to replicate entire tables. However, iof due to ClickHouse amazing compression, we find that a huge amount of raw data compresses down to a small amount. If we then only select a subset of the used columns in our projection then they can be quite small.
Projections do not have a where clause and they cannot include joins, but they do have a
Projections are really an underapprecaited feature of ClickHouse. I find myself having reached for them more than materialised views lately.