Reliably Replicating Data Between PostgreSQL and ClickHouse Part 2 - ClickPipes For PostgreSQL CDC

Benjamin Wootton

Published on January 24, 2025

Reliably Replicating Data Between PostgreSQL and ClickHouse Part 2 - ClickPipes For PostgreSQL CDC

In this series of articles we are going to demonstrate how we can reliably replicate data from PostgreSQL to ClickHouse.

In part one, we will use PeerDB open source to achieve this using an open source and self managed stack. In part two, we will look at how we can use ClickPipes, the data integration component of ClickHouse Cloud to achieve the same result with a fully managed solution. In part three we include a video walkthrough of both tools for those who prefer a visual demo.

We suggest reading part one prior to this for context and key concepts.

PostgreSQL CDC For ClickPipes

ClickPipes is a component of ClickHouse Cloud which makes it easy to ingest data from sources such as S3, Kafka, and Kinesis into your ClickHouse Cloud Instance.

ClickHouse recently added support for PostgreSQL CDC into ClickPipes, making it possible to easily setup and run a process to replicate data from PostgreSQL into ClickHouse Cloud for both an initial migration and then on an ongoing basis as data is inserted, updated and deleted in the source PostgreSQL instance.

Internally, this solution is built on PeerDB open source. However, rather than using PeerDB to deploy a self managed stack as we described in the previous article, you can use ClickPipes to quickly implement the same solution whilst avoiding the need to setup and operate the process yourself.

As well as avoiding the need to run PeerDB, ClickPipes also allows you to avoid third party ETL tools such as AirByte, and Fivetran which are notoriously expensive and likely less performant than ClickPipes.

Private Preview access

As of January 2025, you will need to request access to the private preview via support to enable PostgreSQL CDC for ClickPipes in your Cloud account. There may be a small delay whilst the feature is enabled by support.

Why CDC?

CDC stands for “change data capture”. This involves listening to a low level stream of inserts, updates and deletes (change events) as they take place within your source database. These events are then replicated into your target database to keep the data in line.

Though there are various ways to integrate two databases (for instance, periodically querying it as a batch via SQL), CDC tends to be one of the best ways to achieve this. For instance, it tends to be more resource efficient because the database engine can emit the events directly without the need to repreatedly process a SQL query, and it can also be lower latency as it is a streaming and event based approach.

Walkthrough

We will now walk through the process of setting up and configuring the integration, which primarily happens through the ClickHouse Cloud UI.

Creating Test Data In PostgreSQL

As with the previous article, we will begin by creating a table in our PostgreSQL database. Our test data will bet a set of ecommerce orders including amounts, order status and delivery information:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(50) DEFAULT 'Pending',
    total_amount NUMERIC(10, 2) NOT NULL,
    shipping_address TEXT NOT NULL,
    payment_method VARCHAR(50) NOT NULL,
    shipping_date TIMESTAMP,
    delivery_date TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Generating Test Data

We will generate some random synthetic data into the source table:

INSERT INTO orders (customer_id, order_date, status, total_amount, shipping_address, payment_method, shipping_date, delivery_date)
SELECT 
    (RANDOM() * 1000)::INT + 1 AS customer_id,
    NOW() - INTERVAL '1 day' * (RANDOM() * 365) AS order_date,
    CASE WHEN RANDOM() < 0.2 THEN 'Shipped' WHEN RANDOM() < 0.6 THEN 'Completed' ELSE 'Pending' END AS status,
    ROUND((RANDOM() * 1000)::numeric, 2) AS total_amount,
    '123 Example St, City, Country' AS shipping_address,
    CASE WHEN RANDOM() < 0.5 THEN 'Credit Card' ELSE 'PayPal' END AS payment_method,
    NOW() - INTERVAL '1 day' * (RANDOM() * 30) AS shipping_date,
    NOW() - INTERVAL '1 day' * (RANDOM() * 60) AS delivery_date
FROM generate_series(1, 1000000);

With this insert query we generated a million rows representing our initial snapshot, but the same approach can scale to billions of rows for your benchmarking exercises.

Configuring the ClickPipe

With PeerDB open source you need to configure seperate “peers” and “mirrors” as described in part one.

With ClickPipes, these concepts go away and you simply need to configure a new ClickPipe, specifying the connection details for your source PostgreSQL instance.

We begin by creating a new Data Source and creating a new ClickPipe with type PostgreSQL CDC:

In the next step we specify the connection details to our PostgreSQL database:

Advanced settings such as the PostgreSQL publication, sync interval, batch sizes and the number of concurrent processes can be configured in step 3. These are most relevant to the initial load process where you may have a large backlog of data to replicate, but need to be mindful of the load on your production database:

In step 4 we configure the specific tables and columns which we want to integrate as part of this ClickPipe. It may be the case that we only need to replicate a subset of tables and columns as part of this particular ClickPipe:

Finally, we can select a database role if we wish to limit access for the ClickPipes user. This would be good practice from a security perspective considering the fact that we are integrating with an external system:

Data Replication Process

Having created the ClickPipe, it will then provision and enter a running state. At this point, it is worth checking in

As with PeerDB, there is now a two step process to replicate the tables.

The first is an initial snapshot to batch load the historical data from PostgreSQL into ClickHouse. Though ClickPipes is highly performant, this could take many hours in the case of a very large and complex table, especially where we have to be throttle the process to minimise load on the source database. (However long it takes, it would likely take much longer with some of the similar tools on the market due to the proven high performance of PeerDB that’s running behind the scenes.)

In this toy scenario however, the million orders took less than 20 seconds due to having 4 partitions being loaded in parallel, taking approximately 4 seconds for each partition on average:

Though nothing was optimised here, a million rows in less than 20 seconds including provisioning time demonstrates how performant PeerDB and ClickPipes can be.

After the snapshot, the ClickPipe will then begin listening for insert, update or delete change events which take place in PostgreSQL.

If we generate more rows in our PostgreSQL database, we can see that these will be replicated at the next sync interval via the Metrics tab within the ClickPipe definition:

The ClickHouse Cloud UI exposes various logs, metrics and audit information so that the ClickPipes operation can be monitored. One particularly useful screen for level debugging is the Source tab which shows what is happening for each PeerDB slot, which sequence number has been completed and where it will restart from on restart. You can also use this screen to monitor the specific SQL which is being executed against your PostgreSQL source:

Finally, if you do want to tweak the parameters for the replication such as batch sizes, sync intervals or the subset of tables which are being synced, this can be done via the Settings screen under the ClickPipe definition:

A Brief Performance Evaluation

Though we plan to do more performance evaluation of PeerDB and ClickPipes in future, I wanted to get a sense for it’s performance after writing these blog posts. My aim was to batch load a backlog of 250 million rows into my ClickHouse Cloud cluster. My PostgreSQL database was running on a single fairly underpowered t2.large instance with 8gb RAM and 2 VPCus. My ClickHouse cloud environment was also pinned at 8gb and 2 VCPUs running on the basic plan which means that it cannot autoscale.

I adjusted the parameters for the initial snapshot load to be 8 parallel threads pulling in batches of 250000 records. These values were chosen essentially at random and could be optimised for a mission critical migration.

We found that the initial snapshot of 250 million records took 47 minutes to load:

Considering the fact that both the PostgreSQL and the ClickHouse Cloud instance were underpowered and that we essentially picked settings for the replication at random, we cannot draw many conclusions from these numbers in isolation. However, these numbers do give us a sense of the basline performance and a base to improve upon.

Conclusion

PostgreSQL and ClickHouse are highly complementary technologies for transactional and analytical workloads.

Though there are various options for ETL and CDC between these two databases, PeerDB is now the natural choice considering it’s performance and their recent acquisition by ClickHouse Inc.

In this article we introduced ClickPipes for Postgres CDC, showing how easy it is to create a replication process which is fully managed for you on an ongoing basis.

We also explained how the deep integration with ClickHouse cloud gives you the ability to run PeerDB with no need for configuration, hosting or management, making it a compelling choice for teams that need to reliably integrate these two databases.

In the final part of the series we have included a video walkthrough of both options, making it easy to compare and contrast the two.