Reliably Replicating Data Between PostgreSQL and ClickHouse Part 1 - PeerDB Open Source

Benjamin Wootton

Published on January 23, 2025

Reliably Replicating Data Between PostgreSQL and ClickHouse Part 1 - PeerDB Open Source

In this series of articles we are going to demonstrate how to reliably replicate data from PostgreSQL to ClickHouse.

In part one, we will use PeerDB open source to achieve this using an open source and self managed stack. In part two, we will look at how we can use ClickPipes, the data integration component of ClickHouse Cloud to achieve the same result with a fully managed solution. In part three we include a video walkthrough of both tools for those who prefer a visual demo.

Why PostgreSQL and ClickHouse?

ClickHouse is becoming the default analytical database of choice for Postgres users. This is because both databases share a common open source heritage whilst excelling at different workloads - Postgres for transactional workloads (OLTP) and ClickHouse for analytics (OLAP).

The first use case we come across is a traditional requirement for data warehousing, where businesses want to extract and centralise application data into an OLAP warehouse for analytics, dashboards or reporting. PostgreSQL is already extremely popular as a transactional database, and with the growth of ClickHouse as a real time data warehouse, this combination of databases is naturally growing in adoption.

The second more interesting use cases arises when businesses choose to run both PostgreSQL and ClickHouse to support the same application. Often, these teams start with PostgreSQL behind their application, but over time, the analytical components of their application begin to slow down or put too much strain on the system, negatively affecting the user experience. To resolve this, they add a separate database tailored for large-scale analytics and increasingly reach for ClickHouse.

Data Replication

Once the databases are selected, the next challenge is ensuring reliable, ongoing data replication from PostgreSQL to ClickHouse. While it’s possible to handle this at the application layer by writing data directly into ClickHouse, replicating data at the database level tends to be simpler and more reliable.

Several tools can facilitate this process, such as Debezium, AirByte, and Fivetran. However, it’s worth considering a more recent alternative PeerDB , for several reasons.

First, unlike many ETL tools that aim to support a wide variety of database connectors, PeerDB focuses exclusively on PostgreSQL as a data source. This specialisation has allowed the team to create one of the fastest and most scalable CDC solutions for PostgreSQL, with support for advanced features like complex data types and partitioned tables.

Second, in July 2024, ClickHouse acquired PeerDB. Since then, the integration between the two platforms has deepened, including the private preview of PeerDB integrated into ClickHouse Cloud. This strategic alignment makes PeerDB the natural choice when working with PostgreSQL and ClickHouse.

Self Managed vs Fully Managed

This leaves us with two options for using PeerDB at the time of writing.

The first is to use the PeerDB open source distribution and run it in a self hosted configuration on your own server. This can be used to send data from any Postgres database (including open source or AWS RDS) to either open source ClickHouse and ClickHouse Cloud.

The second option is to use a fully managed version of PeerDB which is integrated and embedded into ClickHouse Cloud. In this instance, the solution is branded as PostgreSQL CDC for ClickPipes, though PeerDB is used behind the scenes.

Though Open Source PeerDB isn’t too tricky to setup and operate, having this critical process deployed and reliably ran for you as a managed service is one less thing to worry about. We would therefore recommend that ClickHouse Cloud users make use of the managed route, discussed in part two.

Core Concepts

PeerDB is responsible for taking data from a PostgreSQL source database and reliably copying it into a target data warehouse, in our case ClickHouse. These sources and targets are referred to as “peers” within PeerDB.

After defining your source and target peers, you will then define “mirrors” which represent your replication processes. For each mirror, you will define which tables you want to replicate and the frequency of the replication. It may be the case, for instance, that some tables need to be fresher than others within ClickHouse so different mirrors can be configured in different ways depending on your requirements.

When integrating PostgreSQL and ClickHouse, there will typically be an initial load or snapshot process to replicate all of your historical data. When doing this, you may need to configure options such as the batch size and the number of parallel processes to run for the snapshot. This may need to be set with care if you are loading a lot of data from a live production system.

After the initial load, PeerDB will then begin replicating all of the inserts, updates and deletes that occur again your chosen tables in PostgreSQL, and apply them to the ClickHouse target tables continuously and in near real-time to keep the tables in line.

As well as simply replicaitng the tables, we also have the option of applying data transformations as data moves through PeerDB. Carrying out these transformations within PeerDB rather can be more efficient whilst supporting use cases such as removing sensitive data or unnesting JSON during the replication process.

CDC vs SQL Integration

There are two primary options for extracting data from your source PostgreSQL database which are referred to as CDC and Query Replication.

CDC (Change Data Capture) is a low level solution whereby PeerDB will source data by listening to a log of low level change events emitted from PostgreSQL.

The Query Replication route extracts data from the PostgreSQL source by periodically issuing a SQL query. This route allows you to do things like applying where clauses onto the query and carry out joins across multiple source tables. This route may also be necessary if your source table does not have a primary key.

The Query Replication route is much more flexible, but there are a few downsides. Firstly, it will need to run as a periodic batch, whereas the CDC mechanism is streaming based and can be run at lower latency. In addition, the Query Replication route would also add more load onto the source database than the CDC approach. With a fast analytical database like ClickHouse and the option of applying transformations within PeerDB, the go-to should be to stick to CDC to emit “low level” events, and do any transformation work within PeerDB or within ClickHouse away from your transactional application database.

Walkthrough

We will now walk through the process of setting up and configuring PeerDB for the first time in order to illustrate the process. We will assume that you have access to a running PostgreSQL and ClickHouse database as a prerequisite. You will also need Docker installed on your system.

Deploying PeerDB

You can download run PeerDB by cloning this repo and running the run-peerdb.sh script contained within:

git clone --recursive https://github.com/PeerDB-io/peerdb.git
cd peerdb
./run-peerdb.sh

PeerDB is deployed as a set of Docker containers which are orchestrated with a Docker compose file contained in the repo above. After a few minutes, you should see a number of Docker containers running on your system:

CONTAINER ID   IMAGE                                                   COMMAND                  CREATED             STATUS                    PORTS                                                                                         NAMES
0ae1675e1119   ghcr.io/peerdb-io/peerdb-ui:stable-v0.22.3              "/app/entrypoint.sh …"   About an hour ago   Up 47 minutes             0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                                     peerdb-ui
49a546b7bf33   ghcr.io/peerdb-io/flow-snapshot-worker:stable-v0.22.3   "./peer-flow snapsho…"   About an hour ago   Up 47 minutes                                                                                                           flow-snapshot-worker
1c4ecc241450   ghcr.io/peerdb-io/flow-worker:stable-v0.22.3            "./peer-flow worker"     About an hour ago   Up 47 minutes                                                                                                           flow-worker
c8e5423090ab   ghcr.io/peerdb-io/flow-api:stable-v0.22.3               "./peer-flow api --p…"   About an hour ago   Up 47 minutes             0.0.0.0:8112-8113->8112-8113/tcp, :::8112-8113->8112-8113/tcp                                 flow_api
74d3b70f0251   temporalio/admin-tools:1.25.2-tctl-1.18.1-cli-1.1.1     "/etc/temporal/entry…"   About an hour ago   Up 47 minutes (healthy)                                                                                                 temporal-admin-tools
b8f81dac7f8c   temporalio/ui:2.34.0                                    "./start-ui-server.sh"   About an hour ago   Up 47 minutes             0.0.0.0:8085->8080/tcp, [::]:8085->8080/tcp                                                   temporal-ui
03f04411a5c8   temporalio/auto-setup:1.26                              "/etc/temporal/entry…"   About an hour ago   Up 47 minutes             6933-6935/tcp, 6939/tcp, 7234-7235/tcp, 7239/tcp, 0.0.0.0:7233->7233/tcp, :::7233->7233/tcp   temporal
5fc3eb9f377b   ghcr.io/peerdb-io/peerdb-server:stable-v0.22.3          "./peerdb-server"        About an hour ago   Up 47 minutes             0.0.0.0:9900->9900/tcp, :::9900->9900/tcp                                                     peerdb-server
380d591d9e30   postgres:17-alpine                                      "docker-entrypoint.s…"   About an hour ago   Up 47 minutes (healthy)   0.0.0.0:9901->5432/tcp, [::]:9901->5432/tcp                                                   catalog
d7957db52ee6   minio/minio:RELEASE.2024-11-07T00-52-20Z                "/bin/sh -c ' export…"   About an hour ago   Up 47 minutes             0.0.0.0:9001->9000/tcp, [::]:9001->9000/tcp, 0.0.0.0:9002->36987/tcp, [::]:9002->36987/tcp    peerdb-quickstart-minio-1

At this stage, PeerDB is ready to be configured and used, but first, we will need some test data to work with!

Creating Test Data In PostgreSQL

We will begin by creating a table in our PostgreSQL database. Our test data will bet a set of ecommerce orders including amounts, order status and delivery information:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'Pending',
    total_amount NUMERIC(10, 2) NOT NULL,
    shipping_address TEXT NOT NULL,
    payment_method VARCHAR(50) NOT NULL,
    shipping_date TIMESTAMP,
    delivery_date TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Generating Test Data

We will then generate some random synthetic data in the source table:

INSERT INTO orders (customer_id, order_date, status, total_amount, shipping_address, payment_method, shipping_date, delivery_date)
SELECT 
    (RANDOM() * 1000)::INT + 1 AS customer_id,
    NOW() - INTERVAL '1 day' * (RANDOM() * 365) AS order_date,
    CASE WHEN RANDOM() < 0.2 THEN 'Shipped' WHEN RANDOM() < 0.6 THEN 'Completed' ELSE 'Pending' END AS status,
    ROUND((RANDOM() * 1000)::numeric, 2) AS total_amount,
    '123 Example St, City, Country' AS shipping_address,
    CASE WHEN RANDOM() < 0.5 THEN 'Credit Card' ELSE 'PayPal' END AS payment_method,
    NOW() - INTERVAL '1 day' * (RANDOM() * 30) AS shipping_date,
    NOW() - INTERVAL '1 day' * (RANDOM() * 60) AS delivery_date
FROM generate_series(1, 1000000);

We have generated a million rows representing our eCommerce orders, but the same technique can scale to greater volumes for your benchmarkings.

Configuring PeerDB

PeerDB can be configured in one of two ways. Firstly, there is a web based UI which can be used to configure your peers and mirrors then monitor the ongoing state of your replication. This UI runs on port 3000 by default:

Alternatively, it is possible to connect to PeerDB and configure it via a SQL interface. This allows you to script your setup in plain SQL which can be source controlled and tested more effectively rather than needing to click around in a UI.

In the example below, we create two peers, then a mirror which replicates the “orders” table in PostgreSQL to the “orders” table in ClickHouse.

-- Create the source Peer
CREATE PEER postgres_peer FROM POSTGRES WITH( host = 'YOUR_POSTGRES_HOST', port = 'YOUR_POSTGRES_PORT', user = 'postgres', password = 'postgres', database = 'postgres' );

-- Create the destination Peer over the native SSL port/TLS
CREATE PEER clickhouse_peer FROM CLICKHOUSE WITH( host = 'YOUR_CLICKHOUSE_HOST', port = '9440', user = 'default', password = 'YOUR_CLICKHOUSE_PASSWORD', database = 'default' );

-- Create the mirror
CREATE MIRROR cdc_mirror FROM postgres_peer TO clickhouse_peer WITH TABLE MAPPING (public.orders:orders) WITH(do_initial_copy = true);

Our preference tends to be a scriptable solution like this so we would tend towards SQL, but the end result is similar either way.

Initial Replication

In less than 20 seconds (surprisingly little time for a million rows), the initial snapshot process will complete. The table will be created in ClickHouse and the data should be replicated into it with the schema matching the PostgreSQL schema.

Audit columns for the sync time, the deleted flag and the peerdb version are added automatically by PeerDB:

SELECT
    order_id,
    customer_id,
    order_date,
    _peerdb_synced_at,
    _peerdb_is_deleted
FROM orders
LIMIT 5

Query id: 70e1364e-66b5-431c-bbbe-e592c4b00556

   ┌─order_id─┬─customer_id─┬─────────────────order_date─┬─────────────_peerdb_synced_at─┬─_peerdb_is_deleted─┐
1. │        1 │         881 │ 2024-03-29 05:17:31.091477 │ 2025-01-24 13:50:44.250000000 │                  0 │
2. │        2 │         874 │ 2024-07-29 14:22:16.122020 │ 2025-01-24 13:50:44.250000000 │                  0 │
3. │        3 │         541 │ 2024-12-31 06:14:07.539825 │ 2025-01-24 13:50:44.250000000 │                  0 │
4. │        4 │         455 │ 2024-12-08 12:56:08.528432 │ 2025-01-24 13:50:44.250000000 │                  0 │
5. │        5 │          12 │ 2024-06-06 10:38:42.814761 │ 2025-01-24 13:50:44.250000000 │                  0 │
   └──────────┴─────────────┴────────────────────────────┴───────────────────────────────┴────────────────────┘

5 rows in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 196.62 KB (2.63 million rows/s., 63.13 MB/s.)
Peak memory usage: 5.12 MiB.

When new data is inserted into the PostgreSQL table, new rows should be propagated into ClickHouse. To demonstrate, we can create another 5000 rows in PostgreSQL:

INSERT INTO orders (customer_id, order_date, status, total_amount, shipping_address, payment_method, shipping_date, delivery_date)
SELECT 
    (RANDOM() * 1000)::INT + 1 AS customer_id,
    NOW() - INTERVAL '1 day' * (RANDOM() * 365) AS order_date,
    CASE WHEN RANDOM() < 0.2 THEN 'Shipped' WHEN RANDOM() < 0.6 THEN 'Completed' ELSE 'Pending' END AS status,
    ROUND((RANDOM() * 1000)::numeric, 2) AS total_amount,
    '123 Example St, City, Country' AS shipping_address,
    CASE WHEN RANDOM() < 0.5 THEN 'Credit Card' ELSE 'PayPal' END AS payment_method,
    NOW() - INTERVAL '1 day' * (RANDOM() * 30) AS shipping_date,
    NOW() - INTERVAL '1 day' * (RANDOM() * 60) AS delivery_date
FROM generate_series(1, 5000);

Within seconds, the same batch of rows should be replicated into ClickHouse.

At this point, it may be worth running a few aggregation queries (sums, averages) to build confidence that the replication is running as expected.

Update and Delete Semantics

By default, PeerDB will create a ReplacingMergeTree table in ClickHouse to hold the incoming data. This means that when a row is replicated into ClickHouse, any row with the same primary key will be replaced on the next merge. This gives you the effect of the data being “updated” as it is updated in PeerDB.

To illustrate this, let’s update a row in PostgreSQL:

update orders set status = 'Cancelled' where order_id = 3;

After a few seconds, the ClickHouse table will contain both rows:

SELECT
    order_id,
    status
FROM orders
WHERE order_id = 3

Query id: 9a77bf2e-31fb-4cf5-a6cb-13968cb333af

   ┌─order_id─┬─status────┐
1. │        3 │ Cancelled │
   └──────────┴───────────┘
   ┌─order_id─┬─status────┐
2. │        3 │ Completed │
   └──────────┴───────────┘

2 rows in set. Elapsed: 0.006 sec. Processed 8.19 thousand rows, 171.83 KB (1.39 million rows/s., 29.17 MB/s.)
Peak memory usage: 2.07 MiB.

When we query with final to illustrate how the data will look post merge, we can see that we only have one row with the row effectively being updated:

SELECT
    order_id,
    status
FROM orders
FINAL
WHERE order_id = 3

Query id: 2594761c-73a7-4981-9661-48a7ad5df14a

   ┌─order_id─┬─status────┐
1. │        3 │ Cancelled │
   └──────────┴───────────┘

1 row in set. Elapsed: 0.006 sec. Processed 16.39 thousand rows, 343.69 KB (2.82 million rows/s., 59.08 MB/s.)
Peak memory usage: 6.42 MiB.

If you do need different behaviour, you can choose to switch to a standard MergeTree to retain the history at the point where you create your PeerDB mirror. As always with ClickHouse, you need to be aware of the semantics associated with the table engine you choose.

Deletes in PostgreSQL are sent into the target with a peerdb_is_deleted column set to 1, representing a logical delete. You would therefore need to handle the fact that the record is only logically deleted either in your downstream ClickHouse views, reports or applications.

For example, a PostgreSQL delete:

delete from orders where order_id = 3;

Is eventually integrated with _peerdb_is_deleted = 1.

SELECT
    order_id,
    status,
    _peerdb_is_deleted
FROM orders
FINAL
WHERE order_id = 3

Query id: 90da35af-1f6f-470c-b932-1455aba26655

   ┌─order_id─┬─status─┬─_peerdb_is_deleted─┐
1. │        3 │        │                  1 │
   └──────────┴────────┴────────────────────┘

1 row in set. Elapsed: 0.006 sec. Processed 16.39 thousand rows, 343.74 KB (2.89 million rows/s., 60.66 MB/s.)
Peak memory usage: 8.18 MiB.

Again, materialised views and different table engines can be used to coerce the data to appear how you need it within ClickHouse.

Schema Evolution

If you add a column onto your source table, this is automatically reflected in ClickHouse at the point that the next record is integrated.

In the example below, we have added a new column premium_customer into PostgreSQL:

alter table orders add premium_customer boolean;

The column is then reflected in PostgreSQL when the next record is replicated with a default value specified:

SELECT premium_customer
FROM orders
LIMIT 5

Query id: 62a26e8d-8892-4abb-bfc6-889212d89197

   ┌─premium_customer─┐
1. │ false            │
2. │ false            │
3. │ false            │
4. │ false            │
5. │ false            │
   └──────────────────┘

5 rows in set. Elapsed: 0.003 sec. Processed 8.19 thousand rows, 40.96 KB (3.24 million rows/s., 16.18 MB/s.)
Peak memory usage: 1.04 MiB.

Dropping a column in your source table does not remove it from ClickHouse, but should set the values to NULL in the target table.

Please note that even though the new column is integrated, ClickHouse schema evolution is not explicitly documented as being supported so tread carefully when updating your source tables!

Gotchas When Running An Open Source Stack

There are a few gotchas when runnning the open source stack which slowed us down with getting started.

The first is related to the MiniIO Configuration. PeerDB takes data from PostgreSQL and puts it in a staging area before it is loaded into ClickHouse. By default this staging area is hosted in a MinIO container which runs inside of the PeerDB Docker compose stack. As ClickHouse will be running outside of Docker, it needs a hostname which resolves to the MiniIO container.

Within your docker-compose.yml file, you will need to change this line:

PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3: http://host.docker.internal:9001

To a real and accessible IP address where MinIO is running:

PEERDB_CLICKHOUSE_AWS_CREDENTIALS_AWS_ENDPOINT_URL_S3: http://172.31.26.57:9001

If running on AWS, you may also need to open a security group port.

Secondly, we experienced a number of issues relating to Docker Snap on Ubuntu. This included containers in Docker compose not being able to connect to each other, and not being able to write to our external EBS volume. Ensure that you have setup Docker following the instructions on their website for an easier time with PeerDB open source.

Aside from these gotchas, the Open Source PeerDB version was relatively easy to deploy and run and we found a single instance to be scalable to millions of rows with low latency.

Conclusion

PostgreSQL and ClickHouse are highly complementary technologies for transactional and analytical workloads.

Though there are various options for ETL and CDC between these two databases, PeerDB is now the natural choice considering it’s performance and their recent acquisition by ClickHouse Inc.

In this article we have introduced core concepts and demonstrated how PeerDB can be deployed as part of an open source and self managed stack.

In the next article we describe how the same can be achieved with a fully managed cloud hosted solution.