Working With Iceberg Data In ClickHouse

Benjamin Wootton

Published on June 22, 2020

Over the last few years the idea of an open table format has arrived. This involves storing data in a format that follows an open specification and standard. Where before data might have been locked away in proprietary formats, open standards mean that data becomes portable and interoperable. We could for instance manage data in one database and then change to another with no migration.

This was a really appealing idea:

It dramatically introduces lock-in.
A single copy of your data can be stored which can be shared. This removes the need for ETL.

A number of competing formats were released in a short space of time. This includes Apache Iceberg, Apache HUDI and Databricks Delta Lake. For a while these competed, but it seems in recent years that Iceberg has by far the most momentum.

ClickHouse Support

This year the ClickHouse project have made a number of strides towards being able to access data stored in open formats.

In the first instance it was possible to

More recently we have been able to access directories, which allow us to get. For example you can.

ClickHosue Cloud are making some innovations such as the ability to query

Typically, Iceberg tables will be created by another database system. For instance, you could run a big transformation job in Spark or Databricks which would.

Creating An Iceberg table

To demonstrate, I am going to develop an Iceberb table which is hosted on AWS S3. We will do this totally indepndenrtly of ClickHouse using a local Python script:

The script creates 3 files. The data is stored in x. Various metadata that includes the schema and the structure is included in Y.

We will upload these files to an S3 bucket.

Accessing The Iceberg Table From ClickHouse

We will query the Iceberg table

Performance

There is a big tradeoff in using Iceberg data in that it comes iwth a performance overhead. When data is stored in ClickHouse, it can be organised to make your queries as performant as possible. This includes on disk layout, compression, the use of indexes and statistics. When we are accessing an Iceberg table, many of these are outside of the control of ClickHouse meaning there is a performance implication.

This said, the performance is still pretty good such that the tradeoff may be worth it.

To test, we can ingest our Iceberg table into a native ClickHouse table with the following command:

We can observe that ClickHouse has managed to compress the Iceberg table.

Performance

There i