Add Monitor AI (shadow watchdog) to security model

Invisible AI watchdog for every flow: - Read-only access to context buffer - Cannot emit messages to pipeline - Agents have no way to detect or probe it - Alerts via control plane (email, UI, auto-stop) - Runs on cheap models (Mistral/Mixtral) Watches for: endless loops, goal drift, prompt injection, sandbox escape attempts, token budget exhaustion. Added to Phase 2 (core safety feature). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 20:54:51 -08:00 · 2026-01-22 20:54:51 -08:00 · d184d22c60
commit d184d22c60
parent d682c677db
1 changed files with 84 additions and 0 deletions
--- a/docs/bloxserver-architecture.md
+++ b/docs/bloxserver-architecture.md
@ -595,6 +595,87 @@ Custom code runs in WASM, which provides:
 └─────────────────────────────────────────────────────────────┘
 ```
 ### Monitor AI (Shadow Watchdog)
 Every flow has an invisible AI watchdog that observes but cannot interfere.
 Agents have no way to detect, probe, or influence the monitor.
 ```
 ┌─────────────────────────────────────────────────────────────────┐
 │                     User's Flow                                  │
 │  ┌─────────┐    ┌─────────┐    ┌─────────┐                     │
 │  │ Agent A │───▶│ Agent B │───▶│ Agent C │                     │
 │  └─────────┘    └─────────┘    └─────────┘                     │
 │        │              │              │                          │
 │        └──────────────┼──────────────┘                          │
 │                       ▼                                          │
 │              ┌─────────────────┐                                │
 │              │ Context Buffer  │ (append-only log)              │
 │              └────────┬────────┘                                │
 └───────────────────────│─────────────────────────────────────────┘
                        │ Read-only tap (invisible to agents)
                        ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │                    Monitor AI (Shadow)                           │
 │                                                                  │
 │  Capabilities:                                                   │
 │  ├── READ: All messages in context buffer                       │
 │  ├── ALERT: Email user, UI notification, webhook                │
 │  └── STOP: Trigger flow stop via control plane                  │
 │                                                                  │
 │  Cannot:                                                         │
 │  ├── Emit messages to pipeline                                  │
 │  ├── Modify context buffer                                      │
 │  └── Be detected by agents (no probe surface)                   │
 │                                                                  │
 │  Watches for:                                                    │
 │  ├── Endless loops (repeating message patterns)                 │
 │  ├── Token budget approaching limit                             │
 │  ├── Goal drift (agent pursuing unintended objectives)          │
 │  ├── Prompt injection attempts                                  │
 │  ├── Sandbox escape attempts (probing for filesystem, etc.)    │
 │  └── Anomalous conversation patterns                            │
 │                                                                  │
 │  Runs on: Cheap models (Mistral, Mixtral, Llama)               │
 └─────────────────────────────────────────────────────────────────┘
                        │
                        ▼ Out-of-band alerts (not via message bus)
              ┌─────────────────┐
              │  Control Plane  │
              │  • Email user   │
              │  • UI alert     │
              │  • Auto-stop    │
              │  • Audit log    │
              └─────────────────┘
 ```
 **Why this architecture:**
 | Property | Benefit |
 |----------|---------|
 | **Asymmetric visibility** | Monitor sees agents; agents can't see monitor |
 | **No message bus access** | Can't be poisoned via prompt injection |
 | **Separate trust domain** | Even compromised agents can't disable watchdog |
 | **Out-of-band control** | Stops flows via API, not messages |
 | **Cheap to run** | Mistral/Mixtral at ~$0.0002/1K tokens |
 **Alert thresholds (configurable):**
 | Condition | Default | Action |
 |-----------|---------|--------|
 | Same message pattern 3x | Warn | UI alert |
 | Same message pattern 5x | Critical | Auto-stop + email |
 | Token usage > 80% budget | Warn | UI alert |
 | Token usage > 95% budget | Critical | Auto-stop |
 | Suspected goal drift | Warn | Flag for review |
 | Escape attempt detected | Critical | Auto-stop + audit log |
 **Implementation:**
 - Monitor runs as sidecar process in pump container
 - Subscribes to Redis buffer updates (read-only)
 - Periodically summarizes recent messages and checks patterns
 - Alerts via control plane API (never via message bus)
 ---
 ## Data Flow Examples
@ -965,6 +1046,7 @@ Good docs help humans AND train the AI — double value.
 - [ ] Webhook triggers
 - [ ] Execution history
 - [ ] Canvas ↔ YAML sync
 - [ ] Monitor AI (shadow watchdog)
 - [ ] Paid tier + Stripe billing
 ### Phase 3: Pro Features (4-6 weeks)
@ -1018,6 +1100,8 @@ Good docs help humans AND train the AI — double value.
 | Code Editor | Monaco (TS mode) | No LSP server needed; asc compiler catches AS errors |
 | Flow Controls | Run/Stop only | No pause, no hot-edit; stateless flows, safe restarts |
 | AI Assistant | Self-hosted flow | Dogfooding: builder is a flow with catalog/validator tools |
 | Monitor AI | Shadow sidecar | Read-only watchdog; agents can't detect or influence |
 | Monitor Model | Mistral/Mixtral | Cheap (~$0.0002/1K); doesn't need frontier model |
 | Control Plane | FastAPI | Matches xml-pipeline, async-native |
 | Database | PostgreSQL | Render managed, reliable |
 | Cache/Pubsub | Redis | Already needed for xml-pipeline shared backend |