cd ..

Using Open Source LLMs To Analyse Data Stored In ClickHouse

Benjamin Wootton
2026-07-03
7 min read
Featured image for Using Open Source LLMs To Analyse Data Stored In ClickHouse

In this post and the accompanying video I will show how we can connect open source and self-hosted LLMs to data stored in ClickHouse.

This is a very relevant topic right now, as a lot of businesses are looking at how to bring AI back into their own data centre or their own accounts, instead of having a dependency on the big frontier model providers.

At the same time, there is huge interest in building agents and connecting them to their proprietary business data for either analytics or other agentic workflows. ClickHouse is front and center in this use case and is very well suited for it.

So here we have two big themes intersecting - agentic AI and the desire to do this on open source sovereign stacks. With this in mind, I wanted to put together a demo of how it can be achieved as it's not as complex as you would expect and can be achieved using commodity virtual machines.

Why Connect LLMs To Your Data?

Firstly, why are people looking to do this in the first place? Why connect LLMs to data, and in our case why ClickHouse in particular?

Broadly, I see three buckets of use cases.

The first is what I call agentic analytics. This is where people want to ask questions about their data in natural language. This is a replacement for dashboards and business intelligence but much more interactive and exploratory.

The second category of use cases are agentic workflows. If you imagine a back office process in finance, HR or logistics, many businesses are deploying agents to carry out those workflows as multiple step processes. Many of those agents need to call into our databases to get data, gather context and make decisions. This differs from the first use case because there's not a human prompting and asking questions interactively. It's happening in the background, fulfilling all kinds of diverse business processes around the clock.

The third bucket relates to observability. As you may know, ClickHouse is really good for logs, metrics, traces and machine generated data. A lot of people are developing agents to monitor that data, identify situations from it, and automatically respond to them. You see this in infrastructure logs, application logs, and in the security world where we're looking to detect breaches. Because ClickHouse is so good at working with this type of data, and because everyone's developing agents, there are a lot of interesting projects being built at this intersection.

Why Move Away From The Big LLM Providers?

So we're looking to connect LLMs to our ClickHouse data. But why do this with open source self-hosted models?

The first driver is about return on investment and cost reduction. As enterprise usage expands, a lot of companies are now trying to get a grip on the cost and asking tough questions about the return on investment. If you can adopt small, specialised open source models, you can take significant cost out of your AI solutions — potentially up to 80%, based on my analysis.

The second is about data sovereignty and privacy. If you're using external LLM providers, you're giving both your sensitive data and information about your business processes, workflows and your intellectual property to a third party. This is driving renewed interest in open source self-hosted models. Customising and fine tuning is also something we can do more easily with open source, so people are looking to tweak them and build differentiated experiences and IP in their models.

The next motivator is business continuity and resilience. As we know, models can become unavailable when they're overloaded, and we've also had a recent instance where models were revoked. This is an unacceptable business continuity risk, so companies are looking at bringing things more into their control to guard against it.

And finally, it's about transparency. Regulared businesses in particular need more visibility of what's going on in the model, what weights have gone into it and how it's been trained and they can't necessarily do that with a black box model provided as a service.

So two big themes are intersecting. One is: how do I connect an LLM to my business data? The second is the drive towards running AI on your own terms, in your own data centre, using open source models without dependencies on third parties.

The Stack

In the video below I show how this can be achiveed.

At the heart of the solution we have ClickHouse. As we know, this is a great data store in situations where we have billions of rows and we need to query in diverse and dynamic ways, including over very real-time, very fresh data. ClickHouse is brilliant for this use case, which is why it's showing up in all kinds of AI products and projects.

Interacting with ClickHouse, we have a model, and for the purposes of the demo in the video I'm using the open source GLM model which has had a lot of attention over the last few months. People believe this is just behind the frontier but at a fraction of the cost.

Finally, we need some way of interacting with that model. In the video I show a chat UI with an agentic style interaction. I ask questions in natural language, it will hit the model, the model will execute tools that go back into ClickHouse to pull relevant context and data, before formulating a response and passing that back to the chat UI.

How It's Deployed

So how has this been deployed? In the demo, this is just a standard virtual machine provided by Hetzner. It's a CPU based server with just 32gb of RAM, not even a GPU based server. I've found that for text-based interactions, document summaries and agentic analytics, the performance is adequate even on CPU.

I used a framework called llama.cpp for hosting the LLM. There's an alternative framework called vLLM which seems to have more mind share and is more efficient, but llama.cpp is easier to get started with and the performance is better for CPU based solutions.

A 4.9 billion parameter version of the GLM model was used. The binary is 6 gigabytes, and the aim is to fit all of that into memory along with the context window and caches. With a large context and the 6 gigabyte binary on a 32 gigabyte machine, the infrastructure seemed adequate for this use case at least to demo standard.

The Agent In Action

I have developed created my own framework for building agents over ClickHouse. This was used to build a 'trade surveillance' example.

Here is a demo of the agent in action against this open source stack:

Wrapping Up

At this point we've got a totally open source, self-hosted stack. I'm running this front-end agent on my own virtual machine, the GLM model is hosted on my Hetzner server and fronted by an OpenAI-compatible API, and I'm also running a private instance of ClickHouse. There are no dependencies on any AI vendors here whatsoever — front to back, open source, self-hosted, all in our control.

The purpose of all of this is to demonstrate how we can build open source (weights) and self hosted based AI solutions. It's fully on our own infrastructure, all of the data remains private, and all of these interactions remain private, nothing is logged, nothing is used for training, and none of our intellectual property is leaked. The solution is also delivered at a fraction of the cost.

I've now completed a number of similar explorations connecting LLMs to ClickHouse. I naturally started with OpenAI and Anthropic, but have been looking at the alternates too and found a better proposition for cost, privacy, security. For instance, I've been using the neoclouds such as Nebius and inference services such as Baseten and Fireworks AI, which are becoming my go-to as they're very fast, very resilient, with a good price/performance trade-off. However, I still think that self hosted, open source, soverign AI will become an increasingly important part of the landscape for businesses who want to use AI for more privacy sensitive use cases.

Portrait of Benjamin Wootton

Written by

Benjamin Wootton

Independent Consultant - ClickHouse

Benjamin Wootton is an independent ClickHouse consultant. I help businesses deploy ClickHouse open source and ClickHouse Cloud, build solutions on top of ClickHouse for real-time analytics, observability and AI, and resolve performance and reliability issues with their existing deployments.

Connect on LinkedIn
END OF FILE Share