Research Decisions
Log of explicit decisions made during the research, with rationale.
Decision 1: MySQL CDC MVP Choice — Debezium Server (HTTP sink) vs Maxwell HTTP
- Context: Need a MySQL CDC solution for an on-prem MVP that is easy to install, has low overhead, and avoids Kafka to reduce complexity.
- Alternatives considered:
- Debezium with Kafka Connect: mature, feature-rich, but requires Kafka cluster (ZooKeeper, brokers, topics). Adds significant ops overhead for a 60‑day pilot.
- Maxwell daemon in HTTP mode: simpler, single Java process, pushes JSON directly to HTTP endpoint. Uses its own offset table in source DB (write to source).
- Debezium Server (HTTP sink): runs Debezium embedded, outputs events directly over HTTP without Kafka. Offsets can be stored externally.
- Custom binlog parser in Go: risky, re‑inventing the wheel, prone to edge‑case bugs.
- Rationale: Choose Debezium Server (HTTP sink) as primary recommendation because it retains Debezium's robust handling of transactions, schema changes, and event formatting while eliminating Kafka. If Debezium Server proves too heavy to configure, Maxwell HTTP is a credible fallback, provided we use a read replica and accept the small offset table on the replica.
- Risks:
- Debezium Server is less commonly used than Kafka Connect; documentation may be sparser.
- Still requires Java runtime; binary size larger than a native Go tool.
- Need to ensure Debezium Server's backpressure handling is acceptable (HTTP 500/retries).
- Maxwell offset table writes to source (if on primary) could be objectionable; mitigate by using replica or switching to Debezium.
- Date: 2026-02-16
Decision 2: SQL Server CDC Approach — Native CDC + Custom Go Poller
- Context: Must capture changes from SQL Server for later phases; we want to avoid Kafka if possible for simplicity.
- Alternatives:
- Debezium SQL Server connector (uses CDC tables, outputs to Kafka): consistent with MySQL if we chose Kafka; but we want to avoid Kafka.
- Native SQL Server CDC with custom poller: enable CDC on needed tables, then periodically call `cdc.fn_cdc_get_all_changes_<capture_instance>` to fetch changes. Store last LSN in our own DB.
- Change Tracking: lighter but lacks before images; insufficient for audit.
- Rationale: Go with native CDC + custom Go poller. It avoids Kafka, can be packaged as part of our overlay service, and the polling interval can be tuned (e.g., every 10 seconds). Overhead is minimal on modern hardware. The poller becomes a first-class ingestion component.
- Risks:
- Polling introduces latency (max poll interval) and could miss rapid changes if not tuned.
- Need to manage LSN persistence; if our poller crashes, we might skip or re-read; must be idempotent.
- Enabling CDC on many tables requires DB space for change tables; need cleanup job.
- Requires `db_owner` or equivalent to enable CDC; may need DBA assistance initially.
- Date: 2026-02-16
Decision 3: Demo Domain — Pharmacy Chain (Preferred)
- Context: Need a concrete domain for synthetic data generation and demo scenarios.
- Alternatives:
- Pharmacy chain: typical tables (branches, products, inventory, sales, adjustments, suppliers, users, roles). Rich audit signals (inventory adjustments, sales, returns, supplier changes).
- Transportation operator: fleet, routes, drivers, maintenance, fuel, incidents. Also interesting but less directly audit‑centric for mid‑size enterprises; pharmacy feels more universal.
- Rationale: Choose Pharmacy chain. It’s a common mid‑size business in LATAM (drugstores, perfumerías). The schema is intuitive: inventory management is core, and anomalies (theft, fraud, stockouts) are easy to understand. Sales and adjustments provide clear audit trails.
- Risks:
- Must ensure synthetic data reflects realistic patterns (daily sales cycles, supplier deliveries, returns).
- Need to design a useful set of anomaly patterns (e.g., employee discount abuse, bogus adjustments, sudden inventory depletion).
- Date: 2026-02-16
Decision 4: Storage Layer — SQLite for MVP, Postgres Upgrade Path
- Context: Need an append‑only event store with indexes. Must be on‑prem friendly and zero‑ops.
- Alternatives:
- SQLite: serverless, single file, simple.
- Postgres: full server, more features, better concurrency, larger scale.
- Embedded KV stores (Badger, Bolt): less query flexibility.
- Rationale: Choose SQLite for MVP. It meets performance needs (single writer, ≤ million‑row table), requires no separate process, and matches “easy install”. We design a repository interface so future switching to Postgres involves swapping implementation only.
- Risks:
- SQLite write concurrency limited to one writer. Our overlay is the only writer, so acceptable.
- Size limit (140 TB) far above pilot needs.
- Date: 2026-02-16
Decision 5: MCP Transport — Stdio Only
- Context: Expose audit tools to external AI agents (ChatGPT, Claude) securely.
- Alternatives:
- Stdio (subprocess, pipes): no network exposure, simple auth (Unix user).
- TCP socket with TLS and token auth: more flexible but requires port management and certificates.
- Rationale: Stdio is simplest and safest for MVP. It assumes a trusted agent on the same host (or launched by the user). No firewall changes needed. We can add TCP later if customers demand remote agent access.
- Risks:
- Only one client at a time; acceptable for personal use.
- Requires agent to spawn the process; not a long‑running daemon. MCP clients typically handle this.
- Date: 2026-02-16
Decision 6: AI Layer — On‑Prem Ollama Default, Cloud Optional
- Context: Summarization and anomaly explanations require an LLM. Need to balance quality, compliance, and operational simplicity.
- Alternatives:
- On‑prem Ollama (e.g., `llama3.2:3b` or `mistral:7b`): no data leaves network, but quality may be lower.
- Cloud LLM (OpenRouter, Anthropic): higher quality but requires outbound internet and data processing agreement.
- Rationale: Default to on‑prem Ollama for MVP, respecting data‑minimization and compliance. Provide configuration switch to cloud LLM if customer explicitly wants better quality and accepts data transfer.
- Risks:
- On‑prem model may produce vague summaries; could affect pilot perception. Mitigation: choose a reasonably capable model (7B) and fine‑tune prompts.
- Date: 2026-02-16
Decision 7: MVP Feature Set — Core Five (Day 60)
- Context: 60‑day timeline is tight; must choose which audit intelligence features to ship.
- Alternatives:
- Full 8‑feature set.
- Core subset: daily summary, inventory anomaly, user suspicion, entity traceability, branch variance.
- Rationale: Prioritize features that cover most audit scenarios and demonstrate clear ROI. Defer sales anomaly, PO fraud, and role change alerts to post‑MVP.
- Risks:
- Some pilots may expect those advanced alerts; manage expectations during sales process.
- Date: 2026-02-16
Further decisions will be added as the research progresses.