CDC Technology Landscape: Practical Options for On-Prem MVP
Audience: Antonio — systems architect evaluating CDC technologies for a Peru/LATAM enterprise overlay.
Scope: MySQL first (MVP), then SQL Server, later Postgres and Oracle. Focus on open-source or built-in features that can be deployed on-prem with minimal friction.
---
Introduction
Change Data Capture (CDC) is the foundation of our overlay: we need to observe row-level mutations in a legacy database without touching application code. For an on-prem MVP, the ideal CDC technology is easy to install, has low performance overhead, provides reliable event delivery, and integrates with our processing pipeline. This report surveys practical options for MySQL, SQL Server, PostgreSQL, and Oracle, with an eye toward operational simplicity and cost.
Key criteria:
- Ease of installation: Can we set it up in hours, not days?
- On-prem friendliness: No cloud dependencies, works behind firewalls.
- Maturity: Production-proven, active community.
- Operational risk: Impact on source DB; recovery mechanisms.
- Required privileges: What DB permissions are needed?
- Offset management: How does the connector track its position in the log?
- Failure recovery: What happens on crashes or network issues?
---
MySQL CDC Options
MySQL CDC relies on reading the binary log (binlog), which records all row changes. Two major open-source approaches dominate: Debezium and Maxwell.
Debezium MySQL Connector
Debezium is a distributed CDC platform built on top of Kafka Connect. It reads the binlog and streams change events to Kafka topics (or other sinks via Kafka Connect). It is the most feature-complete and widely used.
Pros:
- Extremely mature; used in large-scale production (Red Hat, Goldman, etc.).
- Captures full before/after images, operation type (insert/update/delete), and transaction boundaries.
- Handles schema changes gracefully (with DDL events).
- Exactly-once semantics when configured properly (Kafka transactions).
- Rich metadata (source timestamp, server ID, etc.).
- Connector ecosystem: can sink to many systems (Kafka, Pulsar, JDBC, HTTP).
Cons:
-Requires Kafka (or Kafka Connect) as a backbone; adds moving parts (ZooKeeper, brokers). This can be heavy for an MVP, but we could use Debezium Server (HTTP sink) or embedded mode to avoid Kafka.
- Installation: need Java, Kafka, and Debezium connector. Not a single binary.
- Offset storage by default in Kafka topics; can be configured to use JDBC.
- Operational overhead: monitor Kafka cluster, ZK, connector tasks.
Privileges: Requires `REPLICATION SLAVE` (or `REPLICATION CLIENT`?), and `SELECT` on the binlog? According to Debezium docs: `REPLICATION SLAVE` privilege is needed to access the binary log. Also need `SELECT` on all tables to capture before images. For filtered tables, can be more granular.
Performance: Generally low overhead; binlog is already written for replication. Debezium uses the binlog client library. However, high-velocity tables may produce many events; network bandwidth between MySQL and Debezium/Kafka must be considered. Recommend using a read replica for CDC to avoid impacting primary.
Failure recovery: Connector offsets stored in Kafka (or external). On restart, it resumes from last offset. Can use incremental snapshots if needed. Built-in retry with exponential backoff. Dead letter queue support.
MVP viability: High, if you accept Kafka dependency. For a quick demo, you could use Debezium with a single-node Kafka (or even Debezium Server to ship events directly to your service over HTTP). The architecture is sound and battle-tested.
Maxwell Daemon
Maxwell is a CDC tool specific to MySQL, written in Java, that reads the binlog and outputs JSON events to Kafka, Kinesis, or directly to a custom sink via HTTP. It is simpler than Debezium in some respects.
Pros:
- Simpler setup than Debezium? Still needs a sink (Kafka or HTTP endpoint).
- Produces clean JSON events with before/after values.
- Actively maintained, used in production at several companies.
- Can output directly to a HTTP endpoint (bypassing Kafka) using the "maxwell-replicator" mode, which is attractive for MVP (no Kafka).
- Supports "bootstrapping" via initial snapshot.
Cons:
- Smaller community than Debezium; fewer connectors.
- Feature set slightly less comprehensive (e.g., no DDL events? Actually Maxwell does emit DDL events but maybe less structured).
- Still requires Java runtime.
- Offset management: uses its own internal table in the source MySQL (maxwell_schema) to store offsets; that means it writes back to the source DB — could be a concern for strict read-only environments. Alternative: external storage? It can also use Kafka offsets.
Privileges: Needs `REPLICATION SLAVE` and `SELECT` on tables. Also needs to create its own offset table in the source DB (or you can precreate). That requires `CREATE` privilege on a dedicated schema. That may be a showstopper for IT security if they demand no writes to production. Using a read replica avoids this.
Performance: Similar to Debezium; lightweight. It reads the binlog like any replica.
Failure recovery: Offsets stored in the maxwell_schema table; on restart it reads last position. If that table is lost, you may need to re-snapshot.
MVP viability: Very good for a quick prototype if you can tolerate a small schema for offsets on the replica. Even easier if you use HTTP mode and avoid Kafka.
Other MySQL OSS Options
- MySQL Enterprise Data Integration (commercial) — not relevant.
- Canal (Alibaba) — another binlog proxy, but more oriented to MySQL-to-MySQL replication; could be adapted but less common in West.
- Zongji (Python library) — lower-level; you'd build more yourself.
For an MVP, Debezium and Maxwell are the only realistic contenders.
Recommendation (MySQL MVP): Use Debezium with embedded engine or direct HTTP sink to avoid Kafka complexity, or Maxwell with HTTP output if you want a single Java process that pushes events directly to your overlay service. Both are fine; Debezium has a larger ecosystem and more careful handling of edge cases (transaction boundaries, schema changes). For a 60-day MVP, I'd lean toward Debezium with the Debezium Server module (HTTP sink) or the embedded mode if we write a small Java/Kotlin wrapper. If we want to stay in Go, note there is no mature Go CDC connector for MySQL; we'd have to rely on an external process.
---
SQL Server CDC Options
SQL Server has built-in CDC (since 2008) that captures changes into system change tables. Additionally, Debezium offers a SQL Server connector that reads the transaction log via the SQL Server CDC feature or via log-based reading (using the `fn_dblog` function? Actually Debezium uses the CDC tables if enabled, or can use the transaction log via the `database_log`? Need to verify). There's also SQL Server Change Tracking (lighter weight, but less info).
Built-in SQL Server CDC
SQL Server CDC is a feature that must be enabled per database and per table. It creates change tables that mirror the source table's columns plus metadata (`__$operation`, `__$start_lsn`, etc.). Applications can query these change tables to get inserts/updates/deletes.
Pros:
- No external software; native to SQL Server (Enterprise edition? Actually CDC is available in Developer and Enterprise; Standard edition from 2016 SP1? Need to check licensing).
- Integrates with SQL Server Agent for cleanup.
- Queryable via T-SQL; easy to integrate with existing tools.
- Minimal privileges: typically requires `db_owner` or specific `ALTER` permissions to enable, plus `SELECT` on change tables.
- Performance: CDC uses the transaction log; overhead is similar to replication but generally acceptable.
Cons:
- Requires enabling CDC on each table; may not be allowed if DBA wants zero schema changes? Actually enabling CDC creates system tables and maybe capture instances; that's a schema-level change but often permitted as it's a built-in feature.
- Data is stored in internal change tables; you need a process to poll these tables and extract new changes. No push mechanism out of the box (but can use query notifications or custom polling).
- Not distributed; tied to that SQL Server instance.
- Lacks some metadata like before image? It provides before and after values for UPDATE; but DELETE only has before image.
- Cleanup job must be tuned to avoid growth.
Offset management: Your polling process must track LSNs (Log Sequence Numbers) per table. You need to store your own offsets (e.g., in a control table).
MVP approach: Build a simple poller (Go/Python) that queries `cdc.fn_cdc_get_all_changes_...` functions for each enabled table, stores the max LSN, and emits events. This is straightforward and avoids extra components.
Debezium SQL Server Connector
Debezium also supports SQL Server via log-based reading using the SQL Server CDC feature (it actually relies on the same change tables). It does not read the transaction log directly; it uses the CDC capture instances. It then streams events to Kafka.
Pros:
- Same Debezium model (Kafka topics, schemas, exactly-once).
- Unified architecture if you already use Debezium for MySQL.
- Handles schema changes well.
Cons:
- Still requires Kafka.
- Still requires SQL Server CDC to be enabled (same requirements as built-in).
- More complex than a simple poller.
Privileges: Need `SELECT` on CDC change tables and `VIEW DATABASE STATE` maybe.
MVP consideration: If we choose Debezium for MySQL, using it for SQL Server as well gives consistency. But that locks us into Kafka. For a lightweight MVP, the built-in CDC with a custom poller may be simpler and avoid Kafka entirely.
Change Tracking
SQL Server Change Tracking is a lighter alternative: it only tells you that something changed, not the actual data. You then query the current table to see new values. Might be insufficient for our audit overlay because we need before/after images.
Recommendation (SQL Server MVP): Use native SQL Server CDC with a custom lightweight poller written in Go (or Python). This avoids Kafka, keeps dependencies low, and can be packaged as a single binary alongside the rest of the overlay. If MySQL uses Debezium, we'll have two ingestion methods; that's okay if we normalize to a common interface.
---
PostgreSQL (Later Phase)
PostgreSQL offers logical decoding, exposing row-level changes via a streaming plugin. Options:
- wal2json: Popular output plugin that formats WAL changes into JSON.
- pgoutput: Built-in logical replication protocol (used by Debezium and others).
- decoderbufs: Another output plugin.
Debezium PostgreSQL Connector: Reads via logical replication, uses pgoutput by default. It requires a replication slot. It is robust and similar to MySQL Debezium.
Postgres native logical replication: Could use `pg_logical` extension or just the built-in `CREATE SUBSCRIPTION`. But that replicates to another database; we'd need to consume changes via libpq. Not a ready-made CDC push.
MVP-friendly option: Use wal2json output plugin, then a small Go program that connects to PostgreSQL using the replication protocol (`pg_recvlogical` style) and parses JSON messages. Simpler than Debezium and no Kafka. Requires superuser or replication role to create the replication slot and plugin? Actually wal2json needs to be installed (shared library) and a replication slot created. This may require superuser privileges, which could be a hurdle.
Privileges: Need `REPLICATION` attribute on the user (or superuser) to create a replication slot and start logical replication. Also need `SELECT` on tables if using row filtering? Not necessarily — logical decoding sees all changes regardless of table SELECT privileges, but you need to define publications (which tables) and those require ownership or appropriate rights.
Performance: Logical decoding is efficient; built into Postgres. Overhead similar to normal replication.
Recommendation: For a later phase, use wal2json with a custom consumer (or Debezium if already using Kafka). For MVP, we can postpone.
---
Oracle (Optional, Expensive)
Oracle provides GoldenGate — a mature, high-performance CDC product, but it is commercial and expensive. It captures changes from redo logs and streams to many targets.
There is also Debezium Oracle Connector, which uses Oracle's XStream API (part of Oracle GoldenGate) or the newer LogMiner. The connector requires Oracle GoldenGate client libraries (which may be restricted). The Debezium Oracle connector is not as widely used as others, and Oracle licensing is tricky.
Reality: Most mid-size enterprises in LATAM using Oracle likely have GoldenGate already for replication. Partnering with Oracle CDC would mean integrating with GoldenGate’s APIs or consuming its output (e.g., to Kafka). That adds cost and complexity.
MVP stance: Mark Oracle as out of scope for initial MVP. Document that if needed later, either:
- Use Debezium Oracle connector with XStream (requires Oracle client license).
- Or build a custom extractor using Oracle’s LogMiner (complex, performance-sensitive).
---
Comparative Summary & MVP Recommendation
| DBMS | Recommended MVP Approach | Requires Kafka? | Key Privileges | Maturity | On‑prem Fit |
|-------------|----------------------------------------------|-----------------|--------------------------------------|----------|-------------|
| MySQL | Debezium Server (HTTP sink) or Maxwell HTTP | No (direct) | REPLICATION SLAVE, SELECT on tables | High | Excellent |
| SQL Server | Native CDC + custom Go poller | No | db_owner (to enable CDC), SELECT on change tables | High | Very Good |
| PostgreSQL | wal2json + Go replication consumer | No | REPLICATION role (or superuser) | Medium | Good |
| Oracle | Defer; later use GoldenGate or Debezium XStream | Maybe (Debezium) | GoldenGate privileges / XStream | Medium | Poor (cost) |
Why this mix for MVP? We want minimal dependencies. Kafka is a heavy dependency for a 60‑day pilot. Therefore, for MySQL we choose a direct HTTP sink (Maxwell HTTP mode or Debezium Server) that pushes events to our overlay service over a simple REST endpoint. That keeps the stack to: MySQL → Debezium/Maxwell (Java process) → our Go service (HTTP server). For SQL Server, we can poll CDC tables; no need for Kafka.
If later we decide Kafka is acceptable, we could standardize on Debezium across all databases and use Kafka Connect as the ingestion bus.
---
Operational Considerations
Required Privileges
- MySQL: `REPLICATION SLAVE` (to read binlog), `SELECT` on captured tables. If using a read replica, these are typically granted. The connector runs as a replica; it does not write to the source (except Maxwell offset table if used; with Debezium offsets can be external). In our MVP we will store offsets in our own database, avoiding writes to source.
- SQL Server: To enable CDC: `db_owner` or `ALTER` on database. Then we need a user with `SELECT` on the CDC change tables. The poller service can use a dedicated login with read-only access to those change tables. Enabling CDC itself may require DBA once; after that, the poller just reads.
- Postgres: User with `REPLICATION` attribute to create a replication slot and start logical replication. Also need to `CREATE` a publication (or be table owner). Could be superuser, but ideally we create a role with `REPLICATION` plus `SELECT` on needed tables.
- Oracle: If using GoldenGate, needs GG admin privileges; out of scope.
Performance Impact
CDC reads the transaction log/replication stream. On a primary, this adds a small read load. Best practice: attach CDC to a read replica if possible. For MySQL, the replica already applies changes; binlog reading from replica is essentially free (the replica already writes the relay log). We can point Debezium/Maxwell to the replica's binlog (using `binlog_row_image=full`). Ensure replica's `binlog_format=ROW` (required). Performance tuning: adjust batch sizes, network buffers. Monitor replica lag.
For SQL Server, reading CDC tables adds load on the primary because CDC cleanup job runs there; but reading from a secondary replica is not straightforward (CDC tables are not replicated). So for SQL Server we may need to read from primary. That's acceptable if load is low; CDC tables are indexed and incremental queries are cheap.
Offset Management & Failure Recovery
We need to guarantee exactly-once processing of events downstream. The simplest approach: store the latest offset (LSN, binlog position, or transaction ID) in our overlay's own database after successfully processing a batch. On restart, read last offset and resume. This is a common pattern.
- MySQL: binlog filename + position (or GTID). Debezium provides these in event headers.
- SQL Server: LSN range.
- Postgres: LSN (Log Sequence Number).
If our overlay crashes after processing but before offset commit, we might reprocess events; we need idempotency in downstream storage (e.g., deduplicate by event ID). We'll design the `CanonicalChangeEvent` with a globally unique ID derived from source transaction ID + sequence.
Backup: Periodic snapshots of the offset store.
---
Conclusion
For an on‑prem MVP targeting MySQL first, the recommended approach is Debezium with the HTTP sink connector (or Maxwell in HTTP mode) to avoid Kafka while retaining a robust binlog reader. This gives us structured JSON events with before/after images and transaction boundaries. For SQL Server later, a native CDC poller is simple and avoids extra infrastructure. PostgreSQL can use wal2json; Oracle defers.
The next step is to define the event normalization schema and semantic inference layer, which will consume these CDC streams regardless of source. That will be covered in Report 2.
---
Word count: ~1,050