← Back to project

MVP Architecture Specification: On-Prem CDC Observability Overlay



Scope: 60‑day MVP for MySQL, with path to SQL Server later. Target: Peru/LATAM mid-size enterprises. On-prem, minimal dependencies, quick value.

---

High-Level Overview



The overlay sits alongside a legacy application’s database, captures change data via CDC, normalizes events, infers business semantics, stores them in an append-only log, and provides AI‑powered queries and summaries via CLI, web UI, and MCP. No application code changes required.

Core pipeline:


[Source DB binlog / CDC] → [Ingestion Service] → [Normalizer] → [Inference Engine] → [Event Store] → [AI Layer] → [Interfaces]


We treat the overlay as a sidecar that observes and enriches, not as an invasive modification.

---

Component Design



A) CDC Ingestion



Chosen: Debezium Server configured for MySQL with the HTTP sink. Debezium reads the binlog, produces `ChangeEvent` protobuf/JSON, and POSTs to our overlay’s `/ingest` endpoint.

Why Debezium Server (vs Kafka Connect)? Eliminates Kafka dependency while keeping Debezium’s robust binlog parsing, schema handling, and transaction grouping. It’s a single Java process (≈200 MB memory) that can run in a Docker container or as a native process.

Alternative: Maxwell daemon in HTTP mode. Simpler but writes offset table to source DB; we’d need to ensure that’s acceptable on a replica. Debezium can store offsets externally (JDBC or file), giving more flexibility.

Inputs:


Output:

---


B) Event Normalization



Normalization converts Debezium’s MySQL‑specific event format into a CanonicalChangeEvent that shields the rest of the system from source‑specific details.


Raw Debezium Event → Normalizer → CanonicalChangeEvent


CanonicalChangeEvent fields:


Normalization also performs light validation: ensure required fields present; reject malformed events (with alert). If schema mapping for a table is unknown, we store raw payload in a `metadata.raw` field and continue; later we can add mapping definitions.


The normalizer is a pure function; easy to test.

---

C) Semantic Event Inference



Raw change events are low-level. The inference layer maps sequences of changes to BusinessEvents that domain users care about: `InventoryAdjustment`, `Sale`, `UserRoleChange`, `BranchOpening`, etc.

For MVP, we implement a rule engine in Go:


This layer is deliberately simple for MVP; later we could incorporate ML or LLM to detect anomalies or infer hidden relationships.


---

D) Storage Layer



We need an append‑only event log with efficient queries by time, entity, and event type.

Choice: SQLite for MVP.

Why SQLite?


Schema:


sql
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
source_db TEXT NOT NULL,
source_table TEXT NOT NULL,
source_pk_json TEXT NOT NULL, -- JSON of PK
operation TEXT NOT NULL,
before_json TEXT,
after_json TEXT,
timestamp TIMESTAMPTZ NOT NULL,
transaction_id TEXT,
entity_type TEXT,
entity_id TEXT,
business_event_type TEXT,
business_payload_json TEXT,
confidence REAL,
ingested_at TIMESTAMPTZ DEFAULT NOW()
);


Indexes as above.

---

E) AI Layer



Two main AI‑powered features:

1. Daily Audit Summary: At a configurable time (e.g., 23:59), summarize that day’s business events in natural language. We’ll feed the day’s events (summarized as short bullet JSON) to an LLM (on‑prem Ollama or cloud) with a prompt: "You are an audit assistant. Summarize today’s business events in 3–4 paragraphs, highlighting anomalies, high‑volume adjustments, and system errors. Use clear language." The resulting summary is stored and also available via CLI/MCP.

2. Anomaly Detection & Explanation: Simple heuristic baselines:

Heuristics flag events with `anomaly_score` and `reason`. For each flagged event, we optionally call the LLM to generate a natural‑language explanation: "Why is this suspicious?" The LLM receives the event details and returns a short rationale.


Semantic Search (stretch): For MVP we skip full vector search. Instead, we provide a CLI query: `search "broken"` that does a case‑insensitive `LIKE` across `before_json` and `after_json` (or parsed text fields). If we later add embeddings, we could use `sqlite-vss`.

LLM Integration:


---


F) Interfaces



1. CLI (`auditctl`):

Implemented in Go as a subpackage; can be built as a separate binary that talks to the overlay’s HTTP API (or reads SQLite directly).


2. HTTP API (internal):

All endpoints are unauthenticated on localhost; if exposed externally, use mTLS or Basic Auth.


3. Optional Web UI: A single-page React app served from the Go binary using `embed.FS`. Shows recent events, summary, anomalies. Nice‑to‑have for demo; may be cut if time runs short.

4. MCP Server (Model Context Protocol):


---


Deployment Topology (Text Diagram)



All components run on the customer’s on‑prem network, ideally on a dedicated host or VM.


+---------------------+ +-------------------------+
| MySQL (primary) | or | MySQL (read replica) |
| binlog enabled | | binlog enabled |
+----------+----------+ +------------+------------+
| |
| (mysql CDC via Debezium) |
v v
+---------------------+ +-------------------------+
| Debezium Server | | Debezium Server |
| (Java container) | | (Java container) |
+----------+----------+ +------------+------------+
| |
| HTTP POST events | HTTP POST events
v v
+----------------------------------------------------------------+
| Audit Overlay Service (Go) |
| - /ingest endpoint |
| - Normalizer |
| - Inference Engine |
| - SQLite storage (append-only) |
| - AI layer (Ollama or cloud LLM) |
| - HTTP API (internal) |
| - CLI binary (auditctl) |
| - MCP server mode |
+----------------------+---------------------------------------+
|
| Docker network (bridge) or localhost
v
+-------------------+
| Docker Compose |
| (or systemd) |
+-------------------+


Optional:

  • Web UI (served by Go service)

  • Reverse proxy (nginx) for HTTPS access to API/UI



Installation: Docker Compose bundling:


Single command: `docker compose up -d`. Everything starts, Debezium begins streaming, overlay ingests, SQLite initialized. After a few minutes, we have events flowing.


---

Security & Compliance (Brief here, expanded in Report 5)



---


Build vs Buy (Deliverable 8)



Decision: Buy (use OSS) rather than build a custom CDC engine.

Why:


Staged path:
1. MVP: Debezium Server HTTP → overlay.
2. If customers complain about Java footprint, evaluate Maxwell HTTP or a lightweight Go binlog client (e.g., `godriver` with snapshotting). That would be a future phase (post‑MVP).


Time/Risk:


---


Implementation Plan (High-Level)



See Report 6 for detailed backlog. Here’s the milestone view:


---


Conclusion



The proposed MVP architecture is straightforward, on‑prem friendly, and leverages proven OSS components (Debezium, SQLite, Ollama). It delivers tangible auditing capabilities within 60 days while leaving room to grow into SQL Server and other databases. The next report will define the event model and the demo pharmacy domain.

---

Word count: ~1,150