
Enterprise data and analytics platforms can be complex and expensive to operate. Businesses often find themselves running multiple databases to support different data use cases and a plethora of supporting tools for purposes such as ETL and analytics. Often, all of these technologies will be running in some cloud environment such as AWS or Azure which brings additional engineering complexity and overhead. All of this requires a large team to manage and has significant manpower and license costs just to keep the lights on.
ClickHouse and ClickHouse Cloud can help to strip out this complexity and cost. It’s unique features and properties allow you to consolidate onto a single analytical database technology which can serve multiple roles. In turn, this allows you to consolidate down to fewer or even one master copy of the data representing a single source of truth. If you use ClickHouse in more places, it’s benefits compound allowing you to simplify much of the technology that sits around your database. This all result is a leaner technology stack, better data quality and increased collaboration between the data experts in your business as all data is stored in one place.
ClickHouse gets a lot of attention for it’s performance, but I think that in bigger companies, this architectural simplification is one of it’s main benefits. With this in mind, I wanted to take the time to provide more depth on this argument.
One Database Serving Multiple Use Cases
Many businesses find themselves with multiple centralised data stores. There may be a data warehouse or two to support business intelligence workloads (e.g Teradata and Snowflake), data lakes to store unstructured data (e.g. AWS S3), services for time series or machine generated log data (e.g. Splunk), observability systems (e.g. Datadog and New Relic) and real time in-memory caches to serve user facing analytics (e.g. Redis).
ClickHouse is relatively unique in the database space in that it can viably serve many of these use cases. It is becoming a strong vanilla data warehouse which can support SQL based business intelligence workloads (increasingly so as it’s JOIN performance improves), but one which can also scale to large volumes of machine generated log data that might have gone into a system such as Splunk. It is a very natural fit for observability and monitoring data, potentially avoiding the high cost of SaaS observability tools such as Datadog for some use cases. Finally, it can also serve as an aplication backend to serve analytical functions of your user facing applications, playing the role of in memory caches. This is one database viably replacing 5 or 6 different siloed data stores.
Each time we remove a database, we simplify our estate. It’s one less database to run and one less set of infrastructure to pay for and maintain. It’s one less copy of the data and one less set of ETL code to keep running. It’s one less set of skills required and one less license to pay. Data quality rises as we need to replicate data to fewer places, and data analysts and data scientists can more easily join up disparate data and collaborate around the same single source of truth. Consolidating databases should be an absolutely key aim for data leaders and ClickHouse gives us this potential.
It’s SQL Based
Many of the systems above are not SQL based. We may be working with proprietary JSON APIs and document abstractions, and often these APIs are not a natural fit for analytics. Typically developers are comfortable in pulling data from these systems, but business analysts and data scientists are not, meaning that we have to implement an ETL job to extract the data into a more convenient format.
The fact that ClickHouse is SQL based makes your data convenient and accessible to everyone. If, for instance, your observability data is stored in ClickHouse, your developers, data analysts and data scientists can all access the same set of data and join up across disparate datasets. A lot of ETL work is stripped from your organisation.
Simplify Your ETL
Data teams today spend a lot of effort on ETL - extracting data from source systems, transforming it between different formats and loading it into various systems, data warehouses and data lakes. This type of work is bread and butter for data engineers, but it comes with huge overhead. In spite of this investment, we still don’t get it right. On the ground, data teams are constantly faced with broken pipelines and data quality issues and users frequently lose trust in data.
The fact is that the best ETL is no ETL. When we have consolidated onto fewer databases as described above, we simply don’t need to do it. All of the toil and maintenance overheads associated with ETL can be avoided, and all of the data quality issues go away. Hopefully, we will be able to deprecate an ETL tool or two in the process.
ClickHouse can also help us reduce ETL through it’s table engine abstraction. We can point our ClickHouse engine at data that is stored in MySQL or PostgreSQL and query it directly. Likewise, we can use ClickHouse to query local files, or files stored in cloud hosted data lakes in Parquet and Iceberg formats. Each time we do this we avoid the need to copy data.
Finally, if there is no choice but to use ETL, the ClickPipes feature incorporated into ClickHouse Cloud is excellent. It allows us to continually ingest data from S3 buckets, Kafka or even CDC streams to automatically ingest updates from PostgreSQL or MySQL. This can be acheived without any third party tooling and avoid the need for expensive SaaS tools such as Fivetran.
Simplify Your Data Modelling And Orchestration
A significant amount of data engineering work has historically gone into making data reporting ready by preprocessing it and breaking it down into different denormalised tables. This used to happen as part of the ETL process and was referred to as data modelling, whilst in a more modern data stack it became known as analytics engineering popularised by dbt.
Complementing this is the workflow orchestration space. Here we use tools such as Airflow and Dagster to execute data pipelines and run transformation jobs, handling things such as schedules and dependencies, logging and retries.
These tools all have their value, but ClickHouse can minimise the scope and surface area for them. This is because the performance of ClickHouse means that we can work with relatively raw and unprocessed data and we do not need to reorganise data for performance. Our users can work directly with raw data, or we can simply build a thin layer to coerce the data in the format we need to power our reports, dashboards and applications. As new raw data is ingested, it is available for consumption immediately without the need to run batch ETL jobs or transformation pipelines.
Simplify Your Application Architecture
Writing a real time, dynamic application filled with reports, metrics and charts can present some challenges for developers.
Firstly, most of the OLTP databases that they typically develop against are not optimised for analytics. Developers find that their applications slow down and load on their database increases as they include more metrics and analytics in their applications. At this point, they may be tempted to pre-calculate numbers and perhaps store them in a cache to improve application performance and keep load off their database. This can work and scale so far, but it can be tricky to get right and implies much more work for developers.
Second, from a user interface perspective, our users expect real time data and user experiences such as dashboards and applications that update without a page refresh. This means that developers are often pushed towards implementing a subscription system based on web sockets whereby updates can be pushed to the frontend as numbers change.
All of this is implemented with the aim of giving users access to real time, interactive analytics and a snappy user experience, but it’s undoubtedly more complex for developers.
Because of the performance of ClickHouse, my preference is to use ClickHouse as the primary datastore and system of record for the data that powers user facing analytics. Our web applications can then request the aggregations that it needs and the results can be calculated dynamically at request time. Because the performance is
Reducing Requirement For Stream Processing
If you need to calculate and process data in real time, you may be tempted to move into this stream processing space and make use of technologies such as Kafka Streams and Flink.
Whilst these technologies are very impressive and powerful, they can also be relatively difficult to use and reason about. Running a Flink cluster
Businesses who go down this route of stream processing often have two parallel architectures (known as the Kappa architectyre) - one for batch and one for streaming which is another set of effort.
With ClickHouse, we can potentially avoid going down this rabbit hole and use it to solve both batch and streaming related problems. We can stream directly into ClickHouse tables using the Kafka table engine, and have analytics updated downstream using materialised views. This can delay the day that we actually need to build a stream processing architeceture.
Simplify The Data Science Workflow
We previously demonstrated how to make use of ClickHouse for data science work including a forecast, anomaly detection, linear regression without writing a line of Python.
We have found that a lot of work in the data scientist lifecycle can be carried out entirely within ClickHouse. For instance, instead of a seperate feature engineering pipeline and reposistory, we can simply build views over raw data to coerce the data in the format we need to train our models.
We can also use the analytical functions of ClickHouse to perform numerical processing inline inside the database, reducing the amount of code that we need to write outside of the database.
We have also shown how we use ClickHouse to record data from our model training runs and evaluations. This allows us to track the performance of our models and query it directly using SQL without having some external model tracking tool.
As we wrote here, we think that a simple data science notebook provided by a tool such as HEX running against ClickHouse is an incredibly powerful and incredibly simple stack for data science and analytics work.
Reducing Operational Overhead With ClickHouse Cloud
The ClickHouse cloud product is excellent. As it’s fully managed, this allowed you to avoid a significant amount of work around scaling, backups, monitoring and security.
Though the same can be said about the likes of Snowflake, Redshift and BigQuery, these systems do not have an on premise self hosted option as with Open Source ClickHouse. This means that ClickHouse gives us the ability to move between the two if requirements change. I also think this is a key differentiator for ClickHouse.
Simplication
Hopefully we have made the point above that ClickHouse can be a powerful tool for simplifying your data stack, how it can serve multiple roles take the strain whilst other parts of your technology estate.
I do not claim that all of the technologies above become redundant with ClickHouse. At a certain level of scale then maybe orchestration tools, specialised databases, stream processing and observability tools all become relevant. Perhaps there are certain parts of your business where they still add value today. However, with ClickHouse at the heart of your data strategy then maybe their use can be delayed or shrunk to simplify things and take out cost.