Add Monitor AI (shadow watchdog) to security model

Invisible AI watchdog for every flow:
- Read-only access to context buffer
- Cannot emit messages to pipeline
- Agents have no way to detect or probe it
- Alerts via control plane (email, UI, auto-stop)
- Runs on cheap models (Mistral/Mixtral)

Watches for: endless loops, goal drift, prompt injection,
sandbox escape attempts, token budget exhaustion.

Added to Phase 2 (core safety feature).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
dullfig 2026-01-22 20:54:51 -08:00
parent d682c677db
commit d184d22c60

View file

@ -595,6 +595,87 @@ Custom code runs in WASM, which provides:
└─────────────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────────┘
``` ```
### Monitor AI (Shadow Watchdog)
Every flow has an invisible AI watchdog that observes but cannot interfere.
Agents have no way to detect, probe, or influence the monitor.
```
┌─────────────────────────────────────────────────────────────────┐
│ User's Flow │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Agent A │───▶│ Agent B │───▶│ Agent C │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Context Buffer │ (append-only log) │
│ └────────┬────────┘ │
└───────────────────────│─────────────────────────────────────────┘
│ Read-only tap (invisible to agents)
┌─────────────────────────────────────────────────────────────────┐
│ Monitor AI (Shadow) │
│ │
│ Capabilities: │
│ ├── READ: All messages in context buffer │
│ ├── ALERT: Email user, UI notification, webhook │
│ └── STOP: Trigger flow stop via control plane │
│ │
│ Cannot: │
│ ├── Emit messages to pipeline │
│ ├── Modify context buffer │
│ └── Be detected by agents (no probe surface) │
│ │
│ Watches for: │
│ ├── Endless loops (repeating message patterns) │
│ ├── Token budget approaching limit │
│ ├── Goal drift (agent pursuing unintended objectives) │
│ ├── Prompt injection attempts │
│ ├── Sandbox escape attempts (probing for filesystem, etc.) │
│ └── Anomalous conversation patterns │
│ │
│ Runs on: Cheap models (Mistral, Mixtral, Llama) │
└─────────────────────────────────────────────────────────────────┘
▼ Out-of-band alerts (not via message bus)
┌─────────────────┐
│ Control Plane │
│ • Email user │
│ • UI alert │
│ • Auto-stop │
│ • Audit log │
└─────────────────┘
```
**Why this architecture:**
| Property | Benefit |
|----------|---------|
| **Asymmetric visibility** | Monitor sees agents; agents can't see monitor |
| **No message bus access** | Can't be poisoned via prompt injection |
| **Separate trust domain** | Even compromised agents can't disable watchdog |
| **Out-of-band control** | Stops flows via API, not messages |
| **Cheap to run** | Mistral/Mixtral at ~$0.0002/1K tokens |
**Alert thresholds (configurable):**
| Condition | Default | Action |
|-----------|---------|--------|
| Same message pattern 3x | Warn | UI alert |
| Same message pattern 5x | Critical | Auto-stop + email |
| Token usage > 80% budget | Warn | UI alert |
| Token usage > 95% budget | Critical | Auto-stop |
| Suspected goal drift | Warn | Flag for review |
| Escape attempt detected | Critical | Auto-stop + audit log |
**Implementation:**
- Monitor runs as sidecar process in pump container
- Subscribes to Redis buffer updates (read-only)
- Periodically summarizes recent messages and checks patterns
- Alerts via control plane API (never via message bus)
--- ---
## Data Flow Examples ## Data Flow Examples
@ -965,6 +1046,7 @@ Good docs help humans AND train the AI — double value.
- [ ] Webhook triggers - [ ] Webhook triggers
- [ ] Execution history - [ ] Execution history
- [ ] Canvas ↔ YAML sync - [ ] Canvas ↔ YAML sync
- [ ] Monitor AI (shadow watchdog)
- [ ] Paid tier + Stripe billing - [ ] Paid tier + Stripe billing
### Phase 3: Pro Features (4-6 weeks) ### Phase 3: Pro Features (4-6 weeks)
@ -1018,6 +1100,8 @@ Good docs help humans AND train the AI — double value.
| Code Editor | Monaco (TS mode) | No LSP server needed; asc compiler catches AS errors | | Code Editor | Monaco (TS mode) | No LSP server needed; asc compiler catches AS errors |
| Flow Controls | Run/Stop only | No pause, no hot-edit; stateless flows, safe restarts | | Flow Controls | Run/Stop only | No pause, no hot-edit; stateless flows, safe restarts |
| AI Assistant | Self-hosted flow | Dogfooding: builder is a flow with catalog/validator tools | | AI Assistant | Self-hosted flow | Dogfooding: builder is a flow with catalog/validator tools |
| Monitor AI | Shadow sidecar | Read-only watchdog; agents can't detect or influence |
| Monitor Model | Mistral/Mixtral | Cheap (~$0.0002/1K); doesn't need frontier model |
| Control Plane | FastAPI | Matches xml-pipeline, async-native | | Control Plane | FastAPI | Matches xml-pipeline, async-native |
| Database | PostgreSQL | Render managed, reliable | | Database | PostgreSQL | Render managed, reliable |
| Cache/Pubsub | Redis | Already needed for xml-pipeline shared backend | | Cache/Pubsub | Redis | Already needed for xml-pipeline shared backend |