
In this series of articles we are going to demonstrate how to reliably replicate data from PostgreSQL to ClickHouse.
In part one, we will use PeerDB open source to achieve this using an open source and self managed stack. In part two, we will look at how we can use ClickPipes, the data integration component of ClickHouse Cloud to achieve the same result with a fully managed solution. In part three we include a video walkthrough of both tools for those who prefer a visual demo.
Demo
In the video below, I walk through this process of configuring both PeerDB Open Source and ClickPipes CDC to demonstrate both PeerDB open source and ClickPipes for Postgres CDC. The aim is to demonstrate the user experience of both options and illustrate the simplicity of the fully managed solution.
Video Transcript
In this video, I’m going to discuss replicating data from Postgres to ClickHouse, focusing on a tool called PeerDB. Replicating data from Postgres to ClickHouse is becoming an increasingly common requirement. This typically occurs in data warehousing scenarios where data stored in Postgres needs to be moved to ClickHouse for offline analytics. Additionally, some applications experience slowdowns with their Postgres databases. In such cases, it may make sense to move some of the data into ClickHouse and perform the more analytical tasks there, to reduce the load on Postgres. This is another common pattern we encounter.
Once you decide to run both databases in parallel, the next challenge is determining how to get the data into ClickHouse. One option is to handle this at the application layer, where the application writes to both databases or chooses where to write. However, a simpler and more reliable solution is to replicate the data at the database level, cloning and transferring data from Postgres to ClickHouse. There are various tools to achieve this, such as Airbyte in the open-source space, Fivetran in the commercial SaaS space, or even a lower-level approach with Debezium, an open-source change data capture (CDC) framework.
However, it’s worth considering a newer tool called PeerDB for a few reasons. First, PeerDB is an open-source project focused exclusively on Postgres as a data source. Rather than attempting to support all kinds of databases, PeerDB specializes in Postgres, allowing it to develop a high-performing, robust, and fully-featured CDC solution specifically for Postgres. Second, PeerDB was acquired by ClickHouse in 2024, and since then, the integration between the two platforms has deepened. PeerDB is now integrated into ClickHouse Cloud, making it a natural choice for users looking to integrate these two databases.
In this video, I’ll start by demonstrating the open-source PeerDB solution, giving you an idea of how it works behind the scenes and exploring some of the parameters and options it offers. I’ll then move on to the ClickHouse Cloud solution, where PeerDB is integrated into their Clickpipes feature, so you can see both options in action.
Let’s start with the demo. Here, I’m on an empty server with nothing running. PeerDB is distributed as a set of Docker containers, so if I run the script run_peerDB.sh, it will pull down the necessary containers and start them via Docker Compose. One of the services exposed here is the PeerDB UI, which listens on port 3000 by default. If we navigate to it in our browser, the first step is to create two peers—one for the source database (Postgres) and one for the target (ClickHouse).
For the source peer, I configure the Postgres database by specifying the host, port, username, password, and database name. After validating the connection, I create the peer. Then, for the target, I choose ClickHouse. This particular instance is running in ClickHouse Cloud, so I’ll enter the hostname, connect on a secure port, and provide the necessary credentials. I’ll replicate data into the default ClickHouse database for now. Again, I validate the connection and create the peer. At this point, I have two peers, but they’re not yet connected, and no data is being synchronized.
Next, I’ll create an orders table in Postgres and replicate it into ClickHouse. Let’s start by checking that the orders table doesn’t exist in either database. I’ll create the table with some basic e-commerce data, such as payment types, amounts, shipping dates, and order IDs, and I’ll also create an index. Then, I’ll generate 50,000 rows of synthetic random data to populate the table. The data will have random delivery dates, payment types, amounts, and order statuses.
I’ll validate this by running a SELECT COUNT(*) on the orders table in Postgres. We can see that there are 50,000 orders, and a SELECT query on the first five rows shows random data as expected. At this point, no data has been replicated into ClickHouse yet.
Now, I’ll set up the replication. In PeerDB, the replication process is called a mirror. I’ll create a CDC mirror, naming it CDC_orders, and select the source and target databases. I can tweak some parameters, like the frequency of syncing and the degree of parallelism for the initial load of orders. For this demo, I’ll leave the degree of parallelism set to four. I’ll select the orders table in Postgres and replicate it into the orders target table in ClickHouse. After creating the mirror, the process begins in the background.
After a few seconds, we can already see that the 50,000 rows have been replicated from Postgres to ClickHouse, which is remarkably quick. To validate, I can run SELECT COUNT(*) in ClickHouse, and we see that 50,000 rows have been replicated. I can also run an aggregation query to check the order statuses. The results match between Postgres and ClickHouse, confirming that the replication was successful.
Next, I want to load new records as they’re inserted into Postgres. I’ll generate another 10,000 rows in the Postgres orders table. After refreshing the PeerDB UI a few times, we can see that the new batch has been identified and synced. If I run a SELECT COUNT(*) query in ClickHouse, the row count increases to 60,000. I’ll add another 40,000 rows to Postgres and see that they’ve also been synced to ClickHouse. After about 26 seconds, the replication is complete.
At this point, I’ll test updates and deletes. If I update the status of an order in Postgres, I want that change to be reflected in ClickHouse. For example, I’ll change the status of one order to “Cancelled.” After refreshing the UI a few times, I can confirm that the update has been synced. If I run the same aggregation query in ClickHouse, I can see that one order is now marked as “Cancelled.” The update works as expected, and we also need to consider the table engine (in this case, a “Replacing MergeTree”) when performing these operations.
For deletes, PeerDB handles them as logical deletes. If I delete an order from Postgres, the deletion is captured by PeerDB. After refreshing the UI a few times, I can confirm that the order has been logically deleted in ClickHouse. I can verify this by querying for rows where peerDB_is_deleted = 1.
While we’ve been using the PeerDB UI to create peers and mirrors, it’s also possible to control PeerDB through an SQL interface. I’ll demonstrate this by connecting to PeerDB via the psql command line tool. Here, I’ll paste SQL commands to create the source and target peers, followed by the mirror. After executing these commands, I’ll switch back to the PeerDB UI, where I can see that the CDC mirror and the underlying peers have been created as expected.
Now, let’s look at how ClickHouse Cloud handles this. In ClickHouse Cloud, you can set up a Clickpipe for Postgres CDC by specifying connection details and syncing parameters like sync interval and parallelism. After configuring the source and target, ClickHouse Cloud begins replicating the data. Within seconds, the dataset is replicated from Postgres to ClickHouse.
To demonstrate this, I’ve set up a ClickHouse Cloud account with an instance running. After creating a Clickpipe, we can see that the replication process is very quick, with all 1.5 million rows being synced from Postgres to ClickHouse in less than 20 seconds. We can run aggregation queries in ClickHouse to confirm the data looks good.
Next, I’ll update a few rows in Postgres by setting the status to “Cancelled.” After a short wait, the updates are replicated into ClickHouse. Running the aggregation query confirms the data has synced correctly, and the status change was successful.
For deletes, we also use logical deletes in ClickHouse Cloud. After deleting an order in Postgres, the deletion is reflected in ClickHouse after a brief wait.
In summary, we’ve looked at how PeerDB facilitates the integration of Postgres and ClickHouse for both initial migrations and ongoing data syncs. We’ve explored the open-source PeerDB solution and the ClickHouse Cloud integration, comparing the two approaches. Whether you’re using it for offloading analytical workloads or setting up a data warehousing solution, PeerDB and ClickHouse Cloud can play a significant role in your data architecture.
Thank you for watching!