MVP Architecture Specification: On-Prem CDC Observability Overlay
Scope: 60‑day MVP for MySQL, with path to SQL Server later. Target: Peru/LATAM mid-size enterprises. On-prem, minimal dependencies, quick value.
---
High-Level Overview
The overlay sits alongside a legacy application’s database, captures change data via CDC, normalizes events, infers business semantics, stores them in an append-only log, and provides AI‑powered queries and summaries via CLI, web UI, and MCP. No application code changes required.
Core pipeline:
[Source DB binlog / CDC] → [Ingestion Service] → [Normalizer] → [Inference Engine] → [Event Store] → [AI Layer] → [Interfaces]
We treat the overlay as a sidecar that observes and enriches, not as an invasive modification.
---
Component Design
A) CDC Ingestion
Chosen: Debezium Server configured for MySQL with the HTTP sink. Debezium reads the binlog, produces `ChangeEvent` protobuf/JSON, and POSTs to our overlay’s `/ingest` endpoint.
Why Debezium Server (vs Kafka Connect)? Eliminates Kafka dependency while keeping Debezium’s robust binlog parsing, schema handling, and transaction grouping. It’s a single Java process (≈200 MB memory) that can run in a Docker container or as a native process.
Alternative: Maxwell daemon in HTTP mode. Simpler but writes offset table to source DB; we’d need to ensure that’s acceptable on a replica. Debezium can store offsets externally (JDBC or file), giving more flexibility.
Inputs:
- MySQL connection string (host, port, user, password). User needs `REPLICATION SLAVE` and `SELECT` on captured tables.
- Database include/exclude filters for tables to capture.
- HTTP endpoint URL (our overlay service).
- Optional: TLS, keepalive.
Output:
- HTTP POST per batch of events (JSON array). Payload includes:
- `source` (database, table, pk)
- `op` (c for insert, u for update, d for delete, r for read? Actually Debezium: c/u/d/r)
- `before`, `after` (object with column values, or `null` for delete/insert)
- `ts_ms` (source timestamp), `transaction_id` (for grouping)
---
B) Event Normalization
Normalization converts Debezium’s MySQL‑specific event format into a CanonicalChangeEvent that shields the rest of the system from source‑specific details.
Raw Debezium Event → Normalizer → CanonicalChangeEvent
CanonicalChangeEvent fields:
- `event_id`: UUID v5 derived from `source` + `transaction_id` + sequence to guarantee uniqueness.
- `source`: object with `database`, `table`, `primary_key` (map of pk columns → values).
- `operation`: `"INSERT" | "UPDATE" | "DELETE"` (normalized strings).
- `before`: map[string]interface{} or `null`.
- `after`: map[string]interface{} or `null`.
- `timestamp`: RFC3339 nanosecond UTC time from source (`ts_ms` converted).
- `transaction_id`: string (from Debezium, e.g., GTID or binlog filename+pos). Used for grouping.
- `schema_version`: optional string to track table schema changes.
Normalization also performs light validation: ensure required fields present; reject malformed events (with alert). If schema mapping for a table is unknown, we store raw payload in a `metadata.raw` field and continue; later we can add mapping definitions.
The normalizer is a pure function; easy to test.
---
C) Semantic Event Inference
Raw change events are low-level. The inference layer maps sequences of changes to BusinessEvents that domain users care about: `InventoryAdjustment`, `Sale`, `UserRoleChange`, `BranchOpening`, etc.
For MVP, we implement a rule engine in Go:
- Defined rules per domain (pharmacy). Each rule matches a pattern of canonical events within a time window.
- Example rule: `InventoryAdjustment` — an event where table=`inventory` and operation=`UPDATE` and changed columns include `quantity` and optionally `adjustment_reason` is not null. Additionally, if there is a corresponding `sales` row with the same product and timestamp ± few seconds, mark as `Sale`.
- Rules produce a `BusinessEvent`:
- `type`: string (e.g., `inventory.adjustment`)
- `entity_type`, `entity_id` (e.g., `product`, `123`)
- `timestamp`
- `payload`: enriched fields (previous_quantity, new_quantity, reason, linked_sale_id if any)
- `confidence`: high (rule-based)
This layer is deliberately simple for MVP; later we could incorporate ML or LLM to detect anomalies or infer hidden relationships.
---
D) Storage Layer
We need an append‑only event log with efficient queries by time, entity, and event type.
Choice: SQLite for MVP.
Why SQLite?
- Serverless, single file, zero ops.
- Sufficient performance for expected pilot volume (a few hundred thousand events per month).
- ACID, journaling ensures durability.
- Indexes: `CREATE INDEX idx_timestamp ON events(timestamp);`, `CREATE INDEX idx_entity ON events(entity_type, entity_id);`, `CREATE INDEX idx_type ON events(type);`
- We can later migrate to Postgres by using a repository interface (e.g., `Storage` interface with `Append`, `Query`, `Get`).
Schema:
sql
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
source_db TEXT NOT NULL,
source_table TEXT NOT NULL,
source_pk_json TEXT NOT NULL, -- JSON of PK
operation TEXT NOT NULL,
before_json TEXT,
after_json TEXT,
timestamp TIMESTAMPTZ NOT NULL,
transaction_id TEXT,
entity_type TEXT,
entity_id TEXT,
business_event_type TEXT,
business_payload_json TEXT,
confidence REAL,
ingested_at TIMESTAMPTZ DEFAULT NOW()
);
Indexes as above.
---
E) AI Layer
Two main AI‑powered features:
1. Daily Audit Summary: At a configurable time (e.g., 23:59), summarize that day’s business events in natural language. We’ll feed the day’s events (summarized as short bullet JSON) to an LLM (on‑prem Ollama or cloud) with a prompt: "You are an audit assistant. Summarize today’s business events in 3–4 paragraphs, highlighting anomalies, high‑volume adjustments, and system errors. Use clear language." The resulting summary is stored and also available via CLI/MCP.
2. Anomaly Detection & Explanation: Simple heuristic baselines:
- Inventory adjustments > 3σ from historical average.
- After‑hours activity (timestamp outside 08:00–18:00).
- Single user performing many changes in short time.
- Sudden stockout (inventory dropping to zero).
Semantic Search (stretch): For MVP we skip full vector search. Instead, we provide a CLI query: `search "broken"` that does a case‑insensitive `LIKE` across `before_json` and `after_json` (or parsed text fields). If we later add embeddings, we could use `sqlite-vss`.
LLM Integration:
- Abstraction `Summarizer` interface: `Summarize(ctx, payload) (string, error)`.
- Implementations:
- `OllamaSummarizer`: calls local Ollama API (e.g., `http://localhost:11434/api/generate`) with model `llama3.2:3b` or similar.
- `OpenRouterSummarizer`: calls OpenRouter API with Step-3.5‑Flash (or similar).
- Default: On‑prem Ollama, configurable to cloud.
- Data minimization: Before sending to LLM, strip any PII columns (configurable mask list). For demo we might not have real PII, but we design defensively.
---
F) Interfaces
1. CLI (`auditctl`):
- `auditctl summary --date 2025-02-16`
- `auditctl anomalies --start 2025-02-15 --end 2025-02-16`
- `auditctl trace product 12345`
- `auditctl export --format json --out file.json`
- `auditctl health`
2. HTTP API (internal):
- `POST /ingest` — receives CDC batch (JSON), normalizes, stores.
- `GET /summary?date=YYYY-MM-DD` — returns text summary (cached).
- `GET /anomalies?start=...&end=...` — JSON list.
- `GET /events?entity_type=&entity_id=` — paginated events.
- `POST /llm/explain` — (admin) asks LLM to explain an event.
3. Optional Web UI: A single-page React app served from the Go binary using `embed.FS`. Shows recent events, summary, anomalies. Nice‑to‑have for demo; may be cut if time runs short.
4. MCP Server (Model Context Protocol):
- Implement as a separate mode of the binary: `audit-overlay mcp`.
- MCP tools:
- `search_events(query string) → list<Event>`
- `get_daily_summary(date string) → string`
- `get_anomalies(start, end string, severity?) → list<Anomaly>`
- `trace_entity(entity_type, id string) → list<Event>`
- `export_report(format, params) → file or text`
- MCP stdio protocol: read JSON‑RPC requests from stdin, write to stdout.
- Tools must respect data minimization: don’t leak PII; maybe truncate fields.
---
Deployment Topology (Text Diagram)
All components run on the customer’s on‑prem network, ideally on a dedicated host or VM.
+---------------------+ +-------------------------+
| MySQL (primary) | or | MySQL (read replica) |
| binlog enabled | | binlog enabled |
+----------+----------+ +------------+------------+
| |
| (mysql CDC via Debezium) |
v v
+---------------------+ +-------------------------+
| Debezium Server | | Debezium Server |
| (Java container) | | (Java container) |
+----------+----------+ +------------+------------+
| |
| HTTP POST events | HTTP POST events
v v
+----------------------------------------------------------------+
| Audit Overlay Service (Go) |
| - /ingest endpoint |
| - Normalizer |
| - Inference Engine |
| - SQLite storage (append-only) |
| - AI layer (Ollama or cloud LLM) |
| - HTTP API (internal) |
| - CLI binary (auditctl) |
| - MCP server mode |
+----------------------+---------------------------------------+
|
| Docker network (bridge) or localhost
v
+-------------------+
| Docker Compose |
| (or systemd) |
+-------------------+
Optional:
- Web UI (served by Go service)
- Reverse proxy (nginx) for HTTPS access to API/UI
Installation: Docker Compose bundling:
- `debezium-server` (image: `debezium/server:2.5`)
- `audit-overlay` (our custom Go binary, built as Docker image or host binary)
- `mysql` placeholder if customer doesn’t provide their own; for demo we ship a sample MySQL with synthetic data.
Single command: `docker compose up -d`. Everything starts, Debezium begins streaming, overlay ingests, SQLite initialized. After a few minutes, we have events flowing.
---
Security & Compliance (Brief here, expanded in Report 5)
- Secrets via Docker secrets or `.env` file (restricted permissions).
- DB user: read‑only replica account (MySQL: `REPLICATION SLAVE`, `SELECT`).
- Overlay service runs as non‑root user.
- Network: container‑to‑container only; no external exposure unless explicitly configured.
- LLM data: strip PII before sending; default to on‑prem Ollama to keep data inside network.
- Audit trail: the event store itself is tamper‑evident (append‑only). We also log internal actions to a separate file.
---
Build vs Buy (Deliverable 8)
Decision: Buy (use OSS) rather than build a custom CDC engine.
Why:
- Writing a correct, production‑grade binlog parser is a months‑long effort. Debezium has years of battle testing.
- Maxwell is also OSS but Debezium’s server mode gives us clean HTTP output without writing offsets to source.
- Custom Go binlog library exists (`github.com/siddontang/go-mysql`), but we’d still need to handle transaction grouping, schema changes, error recovery — effectively re‑implementing Debezium.
- Time‑to‑market is critical: 60‑day MVP.
Staged path:
1. MVP: Debezium Server HTTP → overlay.
2. If customers complain about Java footprint, evaluate Maxwell HTTP or a lightweight Go binlog client (e.g., `godriver` with snapshotting). That would be a future phase (post‑MVP).
Time/Risk:
- Debezium integration: ~1 week to configure, test, and handle edge cases (DDL, reconnect).
- Custom Go CDC: 4–6 weeks for a reliable implementation; high risk of missing corner cases (GTID, partial updates). Not advisable.
---
Implementation Plan (High-Level)
See Report 6 for detailed backlog. Here’s the milestone view:
- Milestone 1 (Days 1–10): Ingestion pipeline end‑to‑end. MySQL → Debezium → HTTP → normalizer → SQLite. Manual verification.
- Milestone 2 (Days 11–20): Inference rules and business event generation. Demo on synthetic pharmacy schema.
- Milestone 3 (Days 21–30): AI layer: daily summary generation using Ollama; simple anomaly heuristics; caching.
- Milestone 4 (Days 31–40): CLI polished; MCP server mode implemented; basic web UI for demo.
- Milestone 5 (Days 41–50): Security hardening; data masking; installation experience (Docker Compose, scripts).
- Milestone 6 (Days 51–60): End‑to‑end testing with representative data; documentation; pilot handoff materials.
---
Conclusion
The proposed MVP architecture is straightforward, on‑prem friendly, and leverages proven OSS components (Debezium, SQLite, Ollama). It delivers tangible auditing capabilities within 60 days while leaving room to grow into SQL Server and other databases. The next report will define the event model and the demo pharmacy domain.
---
Word count: ~1,150