← Back to project

CDC Technology Landscape: Practical Options for On-Prem MVP



Audience: Antonio — systems architect evaluating CDC technologies for a Peru/LATAM enterprise overlay.

Scope: MySQL first (MVP), then SQL Server, later Postgres and Oracle. Focus on open-source or built-in features that can be deployed on-prem with minimal friction.

---

Introduction



Change Data Capture (CDC) is the foundation of our overlay: we need to observe row-level mutations in a legacy database without touching application code. For an on-prem MVP, the ideal CDC technology is easy to install, has low performance overhead, provides reliable event delivery, and integrates with our processing pipeline. This report surveys practical options for MySQL, SQL Server, PostgreSQL, and Oracle, with an eye toward operational simplicity and cost.

Key criteria:


---


MySQL CDC Options



MySQL CDC relies on reading the binary log (binlog), which records all row changes. Two major open-source approaches dominate: Debezium and Maxwell.

Debezium MySQL Connector



Debezium is a distributed CDC platform built on top of Kafka Connect. It reads the binlog and streams change events to Kafka topics (or other sinks via Kafka Connect). It is the most feature-complete and widely used.

Pros:


Cons:


-Requires Kafka (or Kafka Connect) as a backbone; adds moving parts (ZooKeeper, brokers). This can be heavy for an MVP, but we could use Debezium Server (HTTP sink) or embedded mode to avoid Kafka.


Privileges: Requires `REPLICATION SLAVE` (or `REPLICATION CLIENT`?), and `SELECT` on the binlog? According to Debezium docs: `REPLICATION SLAVE` privilege is needed to access the binary log. Also need `SELECT` on all tables to capture before images. For filtered tables, can be more granular.


Performance: Generally low overhead; binlog is already written for replication. Debezium uses the binlog client library. However, high-velocity tables may produce many events; network bandwidth between MySQL and Debezium/Kafka must be considered. Recommend using a read replica for CDC to avoid impacting primary.

Failure recovery: Connector offsets stored in Kafka (or external). On restart, it resumes from last offset. Can use incremental snapshots if needed. Built-in retry with exponential backoff. Dead letter queue support.

MVP viability: High, if you accept Kafka dependency. For a quick demo, you could use Debezium with a single-node Kafka (or even Debezium Server to ship events directly to your service over HTTP). The architecture is sound and battle-tested.

Maxwell Daemon



Maxwell is a CDC tool specific to MySQL, written in Java, that reads the binlog and outputs JSON events to Kafka, Kinesis, or directly to a custom sink via HTTP. It is simpler than Debezium in some respects.

Pros:


Cons:

Privileges: Needs `REPLICATION SLAVE` and `SELECT` on tables. Also needs to create its own offset table in the source DB (or you can precreate). That requires `CREATE` privilege on a dedicated schema. That may be a showstopper for IT security if they demand no writes to production. Using a read replica avoids this.


Performance: Similar to Debezium; lightweight. It reads the binlog like any replica.

Failure recovery: Offsets stored in the maxwell_schema table; on restart it reads last position. If that table is lost, you may need to re-snapshot.

MVP viability: Very good for a quick prototype if you can tolerate a small schema for offsets on the replica. Even easier if you use HTTP mode and avoid Kafka.

Other MySQL OSS Options



For an MVP, Debezium and Maxwell are the only realistic contenders.


Recommendation (MySQL MVP): Use Debezium with embedded engine or direct HTTP sink to avoid Kafka complexity, or Maxwell with HTTP output if you want a single Java process that pushes events directly to your overlay service. Both are fine; Debezium has a larger ecosystem and more careful handling of edge cases (transaction boundaries, schema changes). For a 60-day MVP, I'd lean toward Debezium with the Debezium Server module (HTTP sink) or the embedded mode if we write a small Java/Kotlin wrapper. If we want to stay in Go, note there is no mature Go CDC connector for MySQL; we'd have to rely on an external process.

---

SQL Server CDC Options



SQL Server has built-in CDC (since 2008) that captures changes into system change tables. Additionally, Debezium offers a SQL Server connector that reads the transaction log via the SQL Server CDC feature or via log-based reading (using the `fn_dblog` function? Actually Debezium uses the CDC tables if enabled, or can use the transaction log via the `database_log`? Need to verify). There's also SQL Server Change Tracking (lighter weight, but less info).

Built-in SQL Server CDC



SQL Server CDC is a feature that must be enabled per database and per table. It creates change tables that mirror the source table's columns plus metadata (`__$operation`, `__$start_lsn`, etc.). Applications can query these change tables to get inserts/updates/deletes.

Pros:


Cons:

Offset management: Your polling process must track LSNs (Log Sequence Numbers) per table. You need to store your own offsets (e.g., in a control table).


MVP approach: Build a simple poller (Go/Python) that queries `cdc.fn_cdc_get_all_changes_...` functions for each enabled table, stores the max LSN, and emits events. This is straightforward and avoids extra components.

Debezium SQL Server Connector



Debezium also supports SQL Server via log-based reading using the SQL Server CDC feature (it actually relies on the same change tables). It does not read the transaction log directly; it uses the CDC capture instances. It then streams events to Kafka.

Pros:


Cons:

Privileges: Need `SELECT` on CDC change tables and `VIEW DATABASE STATE` maybe.


MVP consideration: If we choose Debezium for MySQL, using it for SQL Server as well gives consistency. But that locks us into Kafka. For a lightweight MVP, the built-in CDC with a custom poller may be simpler and avoid Kafka entirely.

Change Tracking



SQL Server Change Tracking is a lighter alternative: it only tells you that something changed, not the actual data. You then query the current table to see new values. Might be insufficient for our audit overlay because we need before/after images.

Recommendation (SQL Server MVP): Use native SQL Server CDC with a custom lightweight poller written in Go (or Python). This avoids Kafka, keeps dependencies low, and can be packaged as a single binary alongside the rest of the overlay. If MySQL uses Debezium, we'll have two ingestion methods; that's okay if we normalize to a common interface.

---

PostgreSQL (Later Phase)



PostgreSQL offers logical decoding, exposing row-level changes via a streaming plugin. Options:


Debezium PostgreSQL Connector: Reads via logical replication, uses pgoutput by default. It requires a replication slot. It is robust and similar to MySQL Debezium.


Postgres native logical replication: Could use `pg_logical` extension or just the built-in `CREATE SUBSCRIPTION`. But that replicates to another database; we'd need to consume changes via libpq. Not a ready-made CDC push.

MVP-friendly option: Use wal2json output plugin, then a small Go program that connects to PostgreSQL using the replication protocol (`pg_recvlogical` style) and parses JSON messages. Simpler than Debezium and no Kafka. Requires superuser or replication role to create the replication slot and plugin? Actually wal2json needs to be installed (shared library) and a replication slot created. This may require superuser privileges, which could be a hurdle.

Privileges: Need `REPLICATION` attribute on the user (or superuser) to create a replication slot and start logical replication. Also need `SELECT` on tables if using row filtering? Not necessarily — logical decoding sees all changes regardless of table SELECT privileges, but you need to define publications (which tables) and those require ownership or appropriate rights.

Performance: Logical decoding is efficient; built into Postgres. Overhead similar to normal replication.

Recommendation: For a later phase, use wal2json with a custom consumer (or Debezium if already using Kafka). For MVP, we can postpone.

---

Oracle (Optional, Expensive)



Oracle provides GoldenGate — a mature, high-performance CDC product, but it is commercial and expensive. It captures changes from redo logs and streams to many targets.

There is also Debezium Oracle Connector, which uses Oracle's XStream API (part of Oracle GoldenGate) or the newer LogMiner. The connector requires Oracle GoldenGate client libraries (which may be restricted). The Debezium Oracle connector is not as widely used as others, and Oracle licensing is tricky.

Reality: Most mid-size enterprises in LATAM using Oracle likely have GoldenGate already for replication. Partnering with Oracle CDC would mean integrating with GoldenGate’s APIs or consuming its output (e.g., to Kafka). That adds cost and complexity.

MVP stance: Mark Oracle as out of scope for initial MVP. Document that if needed later, either:

Given target market (mid-size, limited budgets), Oracle is less common; we can focus on MySQL and maybe SQL Server.


---

Comparative Summary & MVP Recommendation



| DBMS | Recommended MVP Approach | Requires Kafka? | Key Privileges | Maturity | On‑prem Fit |
|-------------|----------------------------------------------|-----------------|--------------------------------------|----------|-------------|
| MySQL | Debezium Server (HTTP sink) or Maxwell HTTP | No (direct) | REPLICATION SLAVE, SELECT on tables | High | Excellent |
| SQL Server | Native CDC + custom Go poller | No | db_owner (to enable CDC), SELECT on change tables | High | Very Good |
| PostgreSQL | wal2json + Go replication consumer | No | REPLICATION role (or superuser) | Medium | Good |
| Oracle | Defer; later use GoldenGate or Debezium XStream | Maybe (Debezium) | GoldenGate privileges / XStream | Medium | Poor (cost) |


Why this mix for MVP? We want minimal dependencies. Kafka is a heavy dependency for a 60‑day pilot. Therefore, for MySQL we choose a direct HTTP sink (Maxwell HTTP mode or Debezium Server) that pushes events to our overlay service over a simple REST endpoint. That keeps the stack to: MySQL → Debezium/Maxwell (Java process) → our Go service (HTTP server). For SQL Server, we can poll CDC tables; no need for Kafka.

If later we decide Kafka is acceptable, we could standardize on Debezium across all databases and use Kafka Connect as the ingestion bus.

---

Operational Considerations



Required Privileges



Performance Impact



CDC reads the transaction log/replication stream. On a primary, this adds a small read load. Best practice: attach CDC to a read replica if possible. For MySQL, the replica already applies changes; binlog reading from replica is essentially free (the replica already writes the relay log). We can point Debezium/Maxwell to the replica's binlog (using `binlog_row_image=full`). Ensure replica's `binlog_format=ROW` (required). Performance tuning: adjust batch sizes, network buffers. Monitor replica lag.

For SQL Server, reading CDC tables adds load on the primary because CDC cleanup job runs there; but reading from a secondary replica is not straightforward (CDC tables are not replicated). So for SQL Server we may need to read from primary. That's acceptable if load is low; CDC tables are indexed and incremental queries are cheap.

Offset Management & Failure Recovery



We need to guarantee exactly-once processing of events downstream. The simplest approach: store the latest offset (LSN, binlog position, or transaction ID) in our overlay's own database after successfully processing a batch. On restart, read last offset and resume. This is a common pattern.


If our overlay crashes after processing but before offset commit, we might reprocess events; we need idempotency in downstream storage (e.g., deduplicate by event ID). We'll design the `CanonicalChangeEvent` with a globally unique ID derived from source transaction ID + sequence.


Backup: Periodic snapshots of the offset store.

---

Conclusion



For an on‑prem MVP targeting MySQL first, the recommended approach is Debezium with the HTTP sink connector (or Maxwell in HTTP mode) to avoid Kafka while retaining a robust binlog reader. This gives us structured JSON events with before/after images and transaction boundaries. For SQL Server later, a native CDC poller is simple and avoids extra infrastructure. PostgreSQL can use wal2json; Oracle defers.

The next step is to define the event normalization schema and semantic inference layer, which will consume these CDC streams regardless of source. That will be covered in Report 2.

---

Word count: ~1,050