import InformationEmbed from ‘components/InformationEmbed.astro’ import Article from ‘layouts/Article.astro’;

ClickHouse is designed for working with very large datasets up to and including petabyte scale. The problem is that we still have to contend with physical memory limits. If we attempt to perform very complex queries, we could find ourselves bumping up against those memory limits. Many users of ClickHouse will find that they begin to get out of memory exceptions where their queries are cancelled by the overcommit tracker.

One situation where you find this is when you are doing group-by queries over large datasets that contain high cardinality data. In this article, I am going to explain why this happens and a little bit about what is happening internally, and then we can talk about some of the ways to mitigate it if you do have to do a large group-by query on your ClickHouse cluster.

When we do a group by, we are essentially asking a question about every combination of values in our relevant columns. So if we have 100 orders in our table and 100 potential products, that means we could have 100 x 100 = 10,000 different combinations. In reality, not every dataset fully overlaps like this. For instance, if I group by city and country, then not every city is in every country. So the situation is much better - it’s very much dependent on the nature of your dataset.

If you have a situation where we have, say, a thousand individual values in one column and a thousand individual values in another column, and let’s assume a worst-case like all combinations are valid, then you quickly have a million possible combinations in your group by, and this is where the memory can begin to inflate. For each of those combinations, we have to store a key in a hash table. So in my example where I have city and country, then I’d have one key saying London, UK, I might have another key saying Manchester, UK, and then I might have Denver, US, and San Francisco, US, and those keys literally have to be the values in the column. So if I’m grouping by big strings, then the keys will also take up a lot of space.

Next, let’s think about the actual aggregation functions that we use. So if we are summing all of the values for a book, then we will have to store some kind of integer to sum up the value. If I’m doing a count, likewise, that’s another integer. If I’m doing something more complex like an average, then I have to include both a count and a sum because of the way that group bys are implemented.

So, what can we do to begin to mitigate? The first thing we have to do is really question if we have to do a group by in this way.

Firstly, do our users really need all of those fields, or are we grouping by something far too granular? If we’re grouping by 5-8 columns, then maybe we can group by at a coarser level. So, first do we actually need to do this?

One technique is to do the group by using fewer columns, but then join to get the other columns after the group by.

Secondly, can we pre-filter the data? So often you’ll see a group by and we’re grouping by pre-filtered data, by unfiltered data. A better example might be to filter the data and then do the group by so we have less rows going into the function. This can save a lot of memory.

Next, think about the aggregation functions that you are using. Do you need averages, or heavyweight aggregations like Unique, or can we store sums and counts separately and calculate the average later on, or can we use approximation.

There are also a number of settings which can help us to reduce the memory usage here: