Architecture Sketches
High-level diagrams and component breakdowns for the AI Audit Overlay MVP.
Mental Model (Initial)
Legacy enterprise app → MySQL database (read-only replica or CDC-enabled primary)
We attach a CDC connector (e.g., Debezium) to the database's binary log. It streams row-level change events to a message broker (Kafka? or direct to consumer). Our service consumes these events, normalizes them into a canonical schema, infers business events (e.g., "inventory adjustment", "sale", "user role change"), stores them in an append-only event log, and provides AI-powered summaries and anomaly detection.
Key idea: We are an overlay — no app code changes. We observe data mutations and reconstruct business processes.
Component Map (Draft)
[MySQL binlog]
↓ (Debezium MySQL connector)
[CDC events] → Kafka (optional intermediary; could be direct)
↓ (ingestion service)
[RawChangeEvent] → Normalizer → [CanonicalChangeEvent]
↓ (inference engine)
[BusinessEvent] → Append-only Event Store (Postgres/SQLite)
↓
[AI Layer] ← query/index service
↓
[CLI + Web UI + MCP]
Simpler for MVP: Debezium → direct gRPC/HTTP consumer → normalization → inference → SQLite/Postgres → FastAPI/Go service → CLI/Web/MCP.
Data Flow
1. Capture: CDC connector reads binlog, produces change events (insert/update/delete with before/after images).
2. Normalize: Convert DB-specific event format to `CanonicalChangeEvent` (table, operation, pk, old, new, timestamp, transaction id).
3. Infer: Apply rules/ML to map sequences of changes to higher-level business events (e.g., "stock transfer" detected from inventory adjustments + location change).
4. Store: Persist events immutably; build secondary indexes for fast query (by entity, time, type).
5. AI: Precompute daily summaries; detect anomalies via heuristics + LLM explanations; semantic search over event corpus.
6. Serve: CLI commands, optional web dashboard, MCP tools for external agents.
Constraints & Assumptions
- On-prem: run inside customer's network; Docker Compose or native binaries; minimal external dependencies.
- Security: read-only DB access; no outbound internet required if using on-prem LLM (or allowlist for cloud LLM).
- Performance: low overhead on DB; CDC connector must not impact production workload significantly.
- Quick value: demo within days on synthetic pharmacy schema.
Open Questions (to move to scrapbook)
- Kafka vs lightweight broker? Or Debezium → direct HTTP sink?
- Which embedded DB for event store? SQLite (single binary) vs Postgres (more features)?
- On-prem LLM options: Ollama? LocalGPT? Or allow cloud LLM with data minimization?
- How to handle schema changes in source DB during pilot?
Keep this updated as architecture evolves.