Several lives ago, I spent a lot of time doing batch ETL. I got pretty good at it, in fact. But we discovered that it is inherently bad. Not so bad that you should avoid it at all costs, but just bad enough that it will introduce issues over time that are really hard to isolate, never mind the burden it puts on the sources system. One day I’ll write about the problems with using an incrementing ID or a “last touch date” will cause you to miss records eventually. But not today. Today is about a more foolproof method that ends up being much easier to set up anyway.
Enter Change Data Capture, or CDC. You can’t hardly swing a dead cat without finding a few hundred articles on CDC, so I won’t bore you with a long explanation of what it is and why it’s important; I’ll distill it down to one sentence: CDC is a real time stream of all the stuff that happened to your database, in sequence, without any additional load being placed on the source system. Inserts, updates, deletes, schema changes, it’s all in that CDC stream. That leaves just two questions: (1) how do I create & populate this stream, and (2) how do I consume it? We will answer the question of how to create & populate the CDC stream right here and now. How to consume it will be covered in Part 2.0+.
But Why Redpanda?
If you’re at all familiar with change data capture, you probably know that the usual reference architecture centers around Apache Kafka to hold the stream of change records generated by the database. It’s a tried and true design, but it also arguably uses a bigger footprint than is necessary since you need to also run a Zookeeper cluster to keep track of your Kafka cluster. Eliminating Zookeeper from Kafka has been on the roadmap for some time, but it’s still a dependency as of January 2023. Apache Kafka is also JVM-based, which is great if you know Java and want to be a committer, but maybe not the optimal choice if you need to maximize performance. But despite all that, Apache Kafka is ubiquitous, and not for any bad reasons.
At the risk of minimizing what Redpanda is, for all intents and purposes it is simply Kafka. Interacting with Redpanda, you wouldn’t know it isn’t Kafka; same functionality & same API, but without the need for Zookeeper. That it uses the same API means that your existing tooling will play just as nicely with Redpanda as they do with Apache Kafka. It is written in C++ as a single binary which natively includes things like a schema registry to simplify your deployment. And being written in C++ over Java, Redpanda shows significantly improved performance metrics over “legacy” Kafka. But lest I sound like a shill for Redpanda, the truth is that I wanted to spend some time with a new toy and it is easy to get up an running so that is “why Redpanda.”

The Setup
As per usual, we’re going to make heavy use of Docker, so you’ll want to be sure you’ve got that installed & working. Conventional wisdom would also dictate that you use Docker compose here, but I’m going to break ranks and spin up the necessary components individually. It would be remiss of me to not mention that much of this is spelled out across various Redpanda blog posts and documentation. But credit where credit is due, if you follow their blog posts to the letter, it won’t work (docker commands are missing some requirements and some incorrect ports are referenced).
Setup Step 1: Docker Network
First things first, we need to create a docker network to allow our containers to talk to one another. We’ll call it rp because after all, this is all about using Redpanda. Every container we spin up will need to specify this network.
docker network create rp
Setup Step 2: Redpanda (aka Kafka)
Next we need to spin up a Redpanda “cluster.” It’s only going to be one node so it’s a little disingenuous to call it a cluster, but the point here isn’t to prove out the viability of distributed computing; the point is to demonstrate a CDC data pipeline using Redpanda. More sophisticated deployment options can be found in Redpanda’s docs.
docker run -d --rm --name=redpanda \
--net rp \
-p 9092:9092 \
docker.redpanda.com/vectorized/redpanda:latest \
redpanda start \
--set redpanda.auto_create_topics_enabled=true \
--overprovisioned \
--smp 1 \
--memory 1G \
--node-id 0 \
--reserve-memory 0M \
--kafka-addr PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092 \
--advertise-kafka-addr PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
Setup Step 3: Redpanda Console
One of the common complaints around Kafka is something called “Kafka blindness.” Out of the box, Kafka doesn’t provide a UI to allow visibility into topics, messages, offsets, etc. There are plenty of tools you can use to restore that visibility, but Redpanda provides their Console tool to cover that gap. For more details and options, the Redpanda docs are going to be your bestie.
Readers of this blog will recall that I love Apache Nifi, which happens to run on port 8080. The Redpanda Console also runs on 8080, so we’ll remap that to 7777 just in case we want to use Nifi later on (spoiler alert?).
docker run -d --rm --name console \
--net rp \
-e KAFKA_BROKERS=redpanda:29092 \
-p 7777:8080 \
docker.redpanda.com/vectorized/console:latest
Once this container is up, point your browser to http://localhost:7777 (or whatever port you used) to verify you can get to the Redpanda console. Since we haven’t created any topics or published any messages, this may leave you with a slightly empty feeling. Welcome to post-modernism.

Setup Step 4: Postgres

We’ll obviously need a database to source some changes for our stream, so why not Postgres? In order for Postgres to allow for streaming its change log, we need to configure the Postgres write-ahead log (WAL) for logical replication. Put the below config into a new file called pgwal.conf saved in your current working directory. When we spin up our Postgres container, we will mount this file as a volume that the database will use for its configuration. Of course you could similarly modify the .conf file of an existing database and be up and running as well.
# CONNECTION
listen_addresses = '*'
# REPLICATION
wal_level = logical
max_wal_senders = 4
max_replication_slots = 4
Feel free to change your Postgres user password if you’re into all that security stuff.
docker run -d --rm --name pg \
--net rp \
-v $PWD/pgwal.conf:/usr/share/postgresql/postgresql.conf.sample \
-e POSTGRES_PASSWORD=admin \
-p 5432:5432 \
postgres:latest
Setup Step 5: Kafka Connect/Debezium
Here is the first time we break from Redpanda components for our Kafka tools. Kafka Connect with Debezium is my favorite way to stream a database change log. Once configured, it will find your database (and optionally, specific tables) and start streaming the Postgres CDC stream into a Redpanda topic. We’re going to leverage Confluent’s Kafka Connect docker image, with one small modification to install the Debezium Postgres connector. Create a new file called dockerfile in your current working directory.
FROM confluentinc/cp-kafka-connect-base:latest
RUN confluent-hub install --no-prompt debezium/debezium-connector-postgresql:latest
And then build a new Docker image from this dockerfile:
docker build --no-cache . -t kconnect-debezium:1.0
Once we have that docker image built, you simply won’t be able to stop yourself from spinning up a so-imaged container. I certainly won’t stand in your way.
docker run -it -d --rm \
--name connect \
--net rp \
-p 8083:8083 \
-e CONNECT_BOOTSTRAP_SERVERS=redpanda:29092 \
-e CONNECT_REST_PORT=8082 \
-e CONNECT_GROUP_ID="1" \
-e CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR=1 \
-e CONNECT_STATUS_STORAGE_REPLICATION_FACTOR=1 \
-e CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR=1 \
-e CONNECT_CONFIG_STORAGE_TOPIC="cdc-config-storage" \
-e CONNECT_OFFSET_STORAGE_TOPIC="cdc-offsets-storage" \
-e CONNECT_STATUS_STORAGE_TOPIC="cdc-status-storage" \
-e CONNECT_KEY_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_VALUE_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_INTERNAL_KEY_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_INTERNAL_VALUE_CONVERTER="org.apache.kafka.connect.json.JsonConverter" \
-e CONNECT_REST_ADVERTISED_HOST_NAME="connect" \
kconnect-debezium:1.0
These environment variables are pretty much standard settings, with the exception of the three “storage topic” items. Those will be the names of the administrative topics that hold the CDC metadata.
Setup Step 6: Register a Postgres/Debezium connector
In order for Kafka Connect to do something (which is sort of the point after all) we need to first register a Debezium connector. It primarily consists of some database connectivity parameters, and some options around what databases, tables, etc you want to stream changes for. But that’s just the surface. You can do some crazy stuff with Debezium, like conditional row filtering, column transformations and other ETL-esque tasks. But don’t take my word for it, see what Debezium has to say about it.
This particular connector is set to point to our Postgres container, which we named “pg,” which is what we use for the database.hostname parameter. Since this is just a demonstration of the possible, we’ll be using the public schema under the postgres database, and authenticating as the postgres user. The name & database.server.name can be whatever you want. Regardless, save this JSON configuration into a file in your current working directory and name it pg-connector.json
{
"name":"cdc-connector",
"config":{
"connector.class":"io.debezium.connector.postgresql.PostgresConnector",
"database.hostname":"pg",
"plugin.name":"pgoutput",
"tasks.max": "1",
"database.port":"5432",
"database.user":"postgres",
"database.password":"admin",
"database.dbname":"postgres",
"schema.include.list":"public",
"database.server.name":"test-cdc"
}
}
To actually register the connector, we’ll pass that JSON file to our Kafka Connect container through a REST endpoint:
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @pg-connector.json
which should return a confirmation with a 201 response and the JSON connector you registered:

Optional: Debezium UI
Issuing curls might appeal to your CICD sensibilities, but sometimes it’s nice to have a UI to work within. Good news, Debezium now provides a web based interface for you to view & manage your connectors, which you can run as a docker container because why not?
docker run -d --rm --name debezium-ui \
--net rp \
-p 7778:8080 \
-e KAFKA_CONNECT_URIS=http://connect:8083 \
debezium/debezium-ui:latest
Pointing your browser to http://localhost:7778 (since it also wants to run on 8080) and you should see your connector with some metrics and current status.

now back to our story…
At this point Kafka Connect will have created the three topics we specified when we stood up the Kafka Connect container. Kafka-bros will want to flex on their knowledge of the Kafka API spec, but the the true Redpanda Chad will just refresh the Redpanda Console to get the same info with less keystrokes.

From here, we are ready to start streaming CDC logs off Postgres over to Redpanda. So take a sip off your Mtn Dew and get ready to do something cool.
Make CDC, not War
In a production application you probably already have hundreds of tables, but in our POC we have a blank slate. What we need to do is create some tables and insert some rows. If you have your favorite data generator, that would work too. But let’s not worry about that, you want to see it work NOW. I’m partial to Dbeaver, but whatever tool you prefer, run these SQL statements in your Postgres container.
create table postgres.public.test_table
(id int
,ts timestamp default now());
create table postgres.public.test_table_the_second
(id int
,ts timestamp default now());
insert into test_table (id) values (100);
insert into test_table_the_second (id) values(200);
Immediately, you should see two new topics show up in your Redpanda Console. I stress immediately, because this is a real time solution as opposed to a batch or even micro-batch, which was what we initially set out to improve upon.

These topics are named for the individual tables that we inserted data into, prefixed with “test-cdc” which the astute reader will recall is what we set in the database.server.name when we registered our Debezium connector. Clicking on one of those topics will allow us to see the actual CDC payloads. Those payloads are rich with information about whatever happened in the database. All of it is interesting in its own right, but of particular interest to us is the payload.before & payload.after items. This shows us what the row looked like prior to the CDC event, and what it looked like after. In this case it was an insert, so the before is NULL. The complete documentation on the payload contents can be found on debezium’s site.

The most important takeaway from this is that we have a complete log of database activity. It’s not compromised by multiple sessions, stale commits, cached sequence values, multiple mid-batch updates, schema changes, or any of the other overlooked use cases that plague batch ETL.
At the same time, the Redpanda console is pretty powerful, allowing you to configure retention policies on your topics, filtering messages to aid with troubleshooting, or even injecting messages directly into the topic without the hassle of building a producer or knowing the API syntax. The schema registry & Kafka Connect UI is also accessible here with a little bit of configuration, which we’ll leave as a topic for another time.
Great, So What’s Next?

These topics contain a sequenced queue of all the things that happened in the database. What you do with that is where the fun really starts. Do you want to use it to drive some event-based architecture? Or perhaps you want to query the stream in real time to identify trending behaviors in real time? Maybe push to a cloud object storage as part of your data lake? Or consume these messages and populate your Snowflake data warehouse? In part 2.0 and beyond of this series, I will explore some of these destinations. Maybe there’s something particularly cute you’d like to see done with a CDC stream? Holla back in the comments. The good news is that we’ve already done the hard part. Treat yourself to another Mtn Dew (I know I will).









































































