Simplifying Your Data Stack With ClickHouse Cloud

Benjamin Wootton

Published on June 12, 2025

Simplifying Your Data Stack With ClickHouse Cloud

Enterprise data and analytics platforms can be complex and expensive to operate. Businesses often find themselves running multiple databases to support different data use cases and a plethora of supporting tools for purposes such as ETL and analytics. Often, all of these technologies will be running in AWS or Azure which brings additional engineering complexity and overhead. All of this requires a large team to manage and has significant manpower and license costs just to keep the lights on.

ClickHouse and ClickHouse Cloud can help to strip out this complexity and cost. It’s unique features and properties allow you to consolidate onto a single analytical database technology which can serve multiple roles. In turn, this allows you to consolidate down to fewer or even one master copy of the data representing a single source of truth. The simplicity and performance of ClickHouse also means that we can simplify much of the technology that sits around your database. This all result is a leaner technology stack, better data quality and better business outcomes.

ClickHouse gets a lot of attention for it’s performance, but I think that in bigger companies, this architectural simplification is one of equal importance. With this in mind, I wanted to take the time to provide more depth on this argument.

One Database Serving Multiple Use Cases

Many businesses find themselves with multiple centralised data stores. There may be a data warehouse or two to support business intelligence workloads (e.g Teradata and Snowflake), data lakes to store unstructured data (e.g. AWS S3), services for time series or machine generated log data (e.g. Splunk), observability systems (e.g. Datadog and New Relic) and real time in-memory caches to serve user facing analytics (e.g. Redis).

ClickHouse is relatively unique in the database space in that it can viably serve many of these use cases. It is becoming a strong vanilla data warehouse which can support SQL based business intelligence workloads (increasingly so as it’s JOIN performance improves), but one which can also scale to large volumes of machine generated log data that might have gone into a system such as Splunk. It is a very natural fit for observability and monitoring data, potentially avoiding the high cost of SaaS observability tools such as Datadog. Finally, it can also serve as an aplication backend to serve analytical functions of your user facing applications, playing the role of in memory caches. This is one database viably replacing 5 or 6 different siloed data stores.

Each time we remove a database, we simplify our estate. It’s one less database to run and one less set of infrastructure to pay for and maintain. It’s one less copy of the data and one less set of ETL code to keep running. It’s one less set of skills required and one less license to pay. Data quality rises as we need to replicate data to fewer places, and data analysts and data scientists can more easily join up disparate data and collaborate around the same single source of truth. Consolidating databases should be an absolutely key aim for data leaders and ClickHouse gives us this potential.

It’s SQL Based

Many of the systems above are not SQL based. They often expose proprietary JSON APIs and may be based on documents rather than relational data. Typically developers are comfortable in pulling data from these systems, but business analysts and data scientists are not, meaning that we have to implement yet another ETL job to extract the data into a more convenient format.

Because ClickHouse is SQL based, it means that everyone can work with the data stored within it without specialist skills. Business intelligence tools and data science notebooks can connect directly to it without any coding or data extraction pipelines.

Simplify Your ETL

Data teams today spend a lot of effort on ETL - extracting data from source systems, transforming it between different formats and loading it into various systems, data warehouses and data lakes. This type of work is bread and butter for data engineers, but it comes with huge overhead. In spite of this investment, we still don’t get it right. On the ground, data teams are constantly faced with broken pipelines and data quality issues and users frequently lose trust in data.

The fact is that the best ETL is no ETL. When we have consolidated onto fewer databases as described above, we simply don’t need to do it. All of the toil and maintenance overheads associated with ETL can be avoided, and all of the data quality issues go away. Hopefully, we will be able to deprecate an ETL tool or two in the process.

The pattern I frequently deploy with ClickHouse is to ingest relatively raw and unprocessed data directly into the database. I will then build a thin layer of views over the data to coerce into a format more convenient to work with. This does not need any ETL either outside or inside of the database. This is only viable because of the unique performance characteristics of ClickHouse.

ClickHouse can also help us reduce ETL through it’s table engine abstraction. We can point our ClickHouse engine at data that is stored in MySQL or PostgreSQL and query it directly. Likewise, we can use ClickHouse to query local files, or files stored in cloud hosted data lakes in Parquet and Iceberg formats. Each time we do this we avoid the need to copy data by querying it in place.

Finally, if there is no choice but to use ETL, the ClickPipes feature incorporated into ClickHouse Cloud is excellent. It allows us to continually ingest data from S3 buckets, Kafka or even CDC streams to automatically ingest updates from PostgreSQL or MySQL. This can be acheived without any third party tooling and avoid the need for expensive SaaS tools such as Fivetran.

Simplify Your Data Modelling And Orchestration

A significant amount of data engineering work has historically gone into making data reporting ready by preprocessing it and breaking it down into different denormalised tables. This used to happen as part of the ETL process and was referred to as data modelling, whilst in a more modern data stack it became known as analytics engineering popularised by dbt.

Complementing this is the workflow orchestration space. Here we use tools such as Airflow and Dagster to execute data pipelines and run transformation jobs, handling things such as schedules and dependencies, logging and retries.

These tools all have their value, but ClickHouse can minimise the scope and surface area for them. This is because the performance of ClickHouse means that we can work with relatively raw and unprocessed data. Our users can work directly with this raw data, or we can simply build ClickHouse views to coerce the data in the format we need to power our reports, dashboards and applications. As new raw data is ingested, it is available for consumption immediately without the need to run batch ETL jobs or transformation pipelines. The need for retries and alerting should be massively reduced with this model.

Simplify Your Application Architecture

Writing a real time, dynamic application filled with reports, metrics and charts can present some challenges for developers.

Firstly, most of the OLTP databases that they typically develop against are not optimised for analytics. Developers find that their applications slow down and load on their database increases as they include more metrics and analytics in their applications. At this point, they may be tempted to pre-calculate numbers and perhaps store them in a cache to improve application performance and keep load off their database. This can work and scale so far, but it can be tricky to get right.

Second, from a user interface perspective, our users expect real time data and user experiences such as dashboards and applications that update without a page refresh or components which load asynchonously. This means that developers are often pushed towards implementing a subscription system based on web sockets whereby updates can be pushed to the frontend as numbers change.

All of this is implemented with the aim of giving users access to real time, interactive analytics and a snappy user experience, but it’s undoubtedly more complex for developers.

Because of the performance of ClickHouse, my preference is to use ClickHouse as the primary datastore and system of record for the data that powers user facing analytics. Our web applications can then request the aggregations that it needs and the results can be calculated dynamically at request time. Everything is snappier without the need for any pre-computation, caches or web sockets.

Reduced Requirement For Stream Processing

If you need to calculate and process data in real time, you may be tempted to move into this stream processing space and make use of technologies such as Kafka Streams and Flink.

Whilst these technologies are very powerful, they can also be relatively difficult to use and reason about, requiring specialist and hard to source development skills. You may also need to run a Flink cluster or make use of a service such as AWS EMR to run your stream processing jobs which is yet another source of complexity and overhead.

Businesses who go down this route of stream processing often have two parallel architectures (known as the Kappa architecture) in order to process both their real time and batch data. This is another set of duplication, complexity and headaches.

With ClickHouse, we can potentially avoid going down this rabbit hole and use it to solve both batch and streaming related problems. We can stream directly into ClickHouse tables using the Kafka table engine, and have analytics updated downstream using materialised views in a way which combines historical data with fresh updates. This can delay the day that we actually need to build a stream processing architeceture.

Simplify The Data Science Workflow

We previously demonstrated how to make use of ClickHouse for data science work including a forecast, anomaly detection, and recommendation system without writing a line of Python.

We have found that a lot of work in the data scientist lifecycle can be carried out entirely within ClickHouse. For instance, instead of a seperate feature engineering pipeline and reposistory, we can simply build views over raw data to coerce the data in the format we need to train our models.

In our articles linked above, we demonstrate how we can use the analytical functions of ClickHouse to perform numerical processing inline inside the database, reducing the amount and surface area of code that we need to write outside of the database.

Though we would typically run machine learning workflows outside of the database, we have shown how we use ClickHouse to record data from our model development and machine learning inference. This allows us to track the performance of our models and query it directly using SQL without having some external model tracking tool.

As we wrote here, we think that a simple data science notebook provided by a tool such as HEX running against ClickHouse is an incredibly powerful and incredibly simple stack which covers a lot of data science and analytics work and allows us to avoid a lot of tooling.

Reducing Operational Overhead With ClickHouse Cloud

ClickHouse can be deployed as a self managed open source version, or consumed through the fully managed cloud service. The cloud product is excellent, whilst allowing you to avoid a significant amount of work around backups, monitoring, security and scaling up and down. This allows you to focus your efforts on actually working with data rather than running infrastructure.

Though the same can be said about the likes of Snowflake, Redshift and BigQuery, these systems do not have an open source on premise option, giving you no exit path if requirements change.

Though people are naturally concerned about the cost of managed services, it should be noted that ClickHouse Cloud can scale to zero. This makes it cheaper than a self managed system where you are needing to pay for cloud infrastructure that is permanently up.

Simplication Matters

Hopefully we have made the point above that ClickHouse has the potential to simplify your data stack and some of the technology and applications that sit around it.

I do not claim that all of the technologies above become redundant with ClickHouse. At a certain level of scale then maybe orchestration tools, specialised databases, stream processing and observability tools all become relevant. Perhaps there are certain parts of your business where they still add value today. However, with ClickHouse at the heart of your data strategy then maybe their use can be delayed or reduced to simplify your world and take out cost.