Strimzi + Debezium in Production: A CDC End-to-End Story

TL;DR

CDC pipelines look impressive on paper. They almost always work between any two adjacent components. They fail in the seams. The morning I went to verify that a freshly-deployed Strimzi + Debezium stack was actually doing its job, the source-to-Kafka leg was perfect — and the Kafka-to-S3 leg was completely missing. That gap was invisible until I walked the data end-to-end.

This is the story of that walk, and what it taught me about deploying Strimzi + Debezium against MongoDB Atlas and Amazon RDS for PostgreSQL on a production-grade EKS cluster.

Cinematic wide shot of an industrial steel pipeline at golden hour with one missing segment in the middle — two flang...

The hook: a “working” pipeline isn’t a working pipeline

We had just spun up a new region — call it <env>-eu-central-1. The previous day I had rotated the MongoDB Atlas admin password so it matched the value stored in AWS Secrets Manager (the password is generated by infrastructure-as-code, but Atlas didn’t know it had been rotated out-of-band). With auth fixed, the connector could finally authenticate. So the obvious next question: is the pipeline actually working?

I started with the easy checks. Strimzi happily reported Ready=True for every KafkaConnector resource. The Kafka Connect cluster pods were healthy. Every task was RUNNING. By every “green dashboard” metric, the platform was perfect.

Except no data was reaching S3.

Why Strimzi + Debezium, briefly

Strimzi is the Kubernetes-native operator family for everything Kafka: brokers, MirrorMaker, and — most relevant here — Kafka Connect. Debezium is the change-data-capture engine that turns a database’s transaction log into a stream of Kafka messages. Together they’re the canonical open-source answer when you want a managed CDC pipeline running on your own Kubernetes — instead of buying a SaaS connector platform.

The combination suits you when:

You want CDC from heterogeneous sources (Postgres, MySQL, MongoDB, SQL Server) into one Kafka substrate.
You already run Kubernetes and don’t want another vendor surface.
You want to keep the data plane (Kafka, RDS, Atlas) on managed services but the control plane (operators, connectors) declarative and GitOps-able.

What it is not: a magic ELT. Debezium produces events. You still own the decisions about what consumes them, where they go, and how you reshape them between the source’s structure and your warehouse’s schema.

How we run it

Architecture diagram showing two source databases (MongoDB and Postgres) feeding through three Kafka Connect workers ...

One EKS cluster per region. Inside, we run three Strimzi KafkaConnect clusters, one per source family:

mongo-source-connect — Atlas change streams via the Debezium MongoDB connector
postgres-source-connect — RDS logical decoding (the pgoutput plugin) via the Debezium Postgres connector
s3-sink-connect — the Confluent S3 sink, writing to IaC-provisioned data-lake buckets

Each KafkaConnect cluster is annotated with strimzi.io/use-connector-resources=true, which tells the operator to load connectors from KafkaConnector custom resources rather than from the REST API. Connectors become regular Kubernetes objects: rendered by Helm, reconciled by Strimzi, observable through kubectl describe.

Credentials live in AWS Secrets Manager (the source of truth — generated by IaC), are mirrored into Kubernetes Secrets by External Secrets Operator, and are wired into the connector config through Kafka Connect’s secrets config provider:

config.providers: secrets,configmaps,file
config.providers.secrets.class: io.strimzi.kafka.KubernetesSecretConfigProvider

So inside any KafkaConnector spec you write ${secrets:<namespace>/<release>:password} and Connect resolves it at runtime. The connector worker never sees the raw secret on disk, and we never template secrets through Helm.

The Helm chart pattern

Each connector cluster ships as a single Helm chart in a data-lake repo, with one values.<env>.yaml per region. The values file exposes a single list:

connectors:
  - name: organization_42_acme
    enabled: true
    snapshot_mode: initial
    database:
      include:
        list: organization_42_acme

The template iterates it:

{{- range $connector := $.Values.connectors }}
{{- if $connector.enabled }}
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
  name: '{{ $.Release.Name }}-{{ $connector.name | replace "_" "-" }}'
  labels:
    strimzi.io/cluster: '{{ $.Release.Name }}'
spec:
  class: io.debezium.connector.mongodb.MongoDbConnector
  config:
    mongodb.connection.string: "mongodb+srv://${secrets:...:user}:${secrets:...:password}@${secrets:...:hostname}"
    database.include.list: {{ $connector.database.include.list }}
    snapshot.mode: {{ $connector.snapshot_mode }}
    # ... a long tail of Debezium tuning lives here
{{- end -}}
{{- end -}}

Adding a new source is a single PR: append an entry to the connectors list. The chart renders one KafkaConnector per entry, the Strimzi operator picks it up, and the worker process registers the connector with Kafka Connect behind the scenes. The whole thing is GitOps-friendly, diffable, and reversible.

Two sources, two flavours of “data”

This is where the data perspective gets interesting. The same Strimzi platform runs both pipelines, but the underlying mechanics share almost nothing:

Aspect	MongoDB Atlas (change streams)	Postgres RDS (`pgoutput`)
Source mechanism	Replica-set oplog → change-stream API	WAL → logical replication slot
Data shape	JSON document (whole, or pre/post-image)	Row tuple bound to a schema
Filtering	`database.include.list`, `collection.include.list`	Postgres publication (`FOR ALL TABLES` or explicit list)
Schema evolution	A new field — just appears in subsequent events	DDL needs `ALTER PUBLICATION` if not `FOR ALL TABLES`, plus replica-identity decisions
Pre-image support	Only with `changeStreamPreAndPostImages` enabled per collection	Always available for `op:u`/`op:d` if `REPLICA IDENTITY FULL`
Slot durability	Cluster-managed, no manual cleanup	Postgres logical slot — if the connector dies and never restarts, the slot pins WAL forever

That last row is the one that keeps Postgres CDC operators up at night. A stale replication slot can fill a 100 GB RDS volume in a long weekend.

End-to-end validation, hop by hop

For each pipeline I ran the same five-step walk:

Resource ready? — kubectl get kafkaconnector -o jsonpath='{...Ready.status}'
Connector running on the worker? — query the Connect REST API: /connectors/<name>/status
Task running? — same endpoint, check tasks[].state
Topic has messages? — kafka-topics --list + GetOffsetShell to confirm a non-zero high-water mark
Message has the right shape? — kafka-console-consumer with a fresh consumer-group ID (overriding the Connect default), then parse the JSON

For Atlas, I created a fresh test database, applied an ad-hoc KafkaConnector CR with snapshot.mode: never and capture.mode: change_streams_update_full (so I didn’t need to enable pre/post images on the collection), inserted one document, and watched the event come back through the topic with a full op:c payload and the correct source.db and source.collection fields.

For Postgres, both publications were puballtables=t, which meant any new table I created would be auto-captured. I created public.cdc_test_<ts>, inserted one row, and read the event back from <prefix>.public.cdc_test_<ts> — full payload, with LSN and ts_ms.

Both pipelines: green. Both sources producing real production traffic too — one CDC topic had already accumulated 899 messages from a snapshot, so this wasn’t just my test row.

Five glowing green lab beakers in a row on a shelf, each with a checkmark above it, and a sixth empty dusty beaker on...

Where the pipeline actually broke

I checked the S3 sink connectors. They were RUNNING. Their topic regexes targeted prefixes like ^datalakeV3.mongodb.* and ^datalake.api_service_change_log.public.*. So I listed every topic on the cluster.

Zero topics matched.

The three S3 sink consumer groups had no assigned partitions. The two pre-provisioned data-lake buckets — created by IaC as part of the regional build-out — were completely empty.

The missing piece wasn’t the connectors or the sinks. The missing piece was the transformer — a small streaming app that consumes raw CDC topics (mongodb_source_connector.*, api_service_change_log.public.*) and re-publishes them under the datalake.* and datalakeV3.* naming convention the sinks expected. That service was running in the older regions and never got included in the new region’s rollout.

The gap was invisible at every layer’s local dashboard. Strimzi: green. Debezium: green. Kafka topics: green. S3 sinks: green. IaC apply: green. But the topics the sinks were waiting for didn’t exist, because the producer of those topics didn’t exist.

Flow diagram of a CDC pipeline with Kafka topics on the left emitting blue data streams that fade into an empty dashe...

Lessons I’m taking forward

Verify the seam, not the endpoint. “Component X reports healthy” doesn’t tell you whether the message actually crossed into component Y’s input. The only honest test is to insert a row at one end and watch it arrive at the other. Everything else is a unit test of one stage.

Topic naming is interface design. Having mongodb_source_connector.foo on one side and the sinks expecting datalakeV3.mongodb.foo on the other isn’t a bug. It’s two teams that didn’t share a naming contract. We’re standardizing the prefix scheme as a typed value in the platform chart, so the input regex of one stage is the output prefix of the previous stage by construction, not by convention.

Default capture modes have hidden requirements. The Mongo chart defaults capture.mode: change_streams_update_full_with_pre_image, which silently requires changeStreamPreAndPostImages: { enabled: true } on every captured collection. The connector won’t refuse to start — it will start, and then events will be missing fields. Make this explicit in the chart values, with a per-collection override path.

Postgres FOR ALL TABLES is a trap or a gift. Auto-capture is wonderful in test environments and dangerous in production. A developer creates a new table without thinking about CDC, events start flowing, a downstream consumer breaks because the schema isn’t registered. Either commit to FOR ALL TABLES with strict downstream tooling, or use explicit publications. The middle path is the worst path.

Replication slots need a leash. Set max_slot_wal_keep_size on the Postgres side, alert on slot lag, and put a runbook on the alert. The cost of leaving this default is one bad weekend per year.

Where I’d push the platform next

Split illustration — left side shows two separate manifests with a jagged crack between them (one with a Kafka icon, ...

The chart deploys the Kafka Connect cluster and the Debezium connectors. It does not deploy the IAM role that the S3 sink assumes, the S3 bucket it writes to, or the bucket policy. Those still live in our IaC repo, behind a separate review and a separate apply.

That seam — Helm chart for the runtime, IaC for the AWS plumbing — is exactly where the rollout to a new region drops a piece on the floor, as we saw with the missing transformer. The natural next move is to push the AWS resources into the same chart, via Crossplane:

A Composition that creates the S3 bucket, the IAM role with the right trust policy, and the policy attachment for the IRSA service account that the sink pod uses.
A Composite Resource Claim rendered by the chart and referenced by the connector spec.
One PR, one chart, one region — no IaC drift, no naming mismatch between layers.

I’m writing that up as a follow-on post: “From Pulumi to Crossplane: shipping IAM and S3 buckets inside your Helm chart”. The TL;DR is simple: if the chart that ships the connector also ships the bucket and the role, you can’t deploy half the pipeline.

What to watch in production

A short, opinionated checklist:

Postgres slot lag — alert when confirmed_flush_lsn falls more than N WAL segments behind pg_current_wal_lsn(). Page on it.
Restart counts on connector tasks — a task that flaps every few minutes is succeeding from Strimzi’s point of view and failing from the data’s point of view.
Sink consumer-group assignment — if a sink’s group has zero assigned partitions for more than a few minutes, the regex doesn’t match what’s actually being produced.
Pre-image enablement on Mongo collections you care about — script this as part of the onboarding for a new captured database, don’t leave it to chance.
Topic creation policy — topic.creation.enable: true is convenient for greenfield, terrifying when an operator typo creates customer-event and customer_events side by side.
Schema-change events — Debezium emits them, your downstream had better consume them.

Conclusion

Strimzi + Debezium is one of the most production-worthy open-source platforms I’ve put into a customer’s stack in the last five years. The operator is conservative, the resources are debuggable, and the connector ecosystem is broad enough that we run Mongo, Postgres, and S3 sink on the same chassis with the same idioms.

But “the platform works” and “the pipeline works” are different statements. The platform shipped green. The pipeline only worked when we walked it end to end and forced the gap to become visible. If you only take one thing from this post: build the end-to-end test before you ship the platform. The test is what tells you the platform is finished.