xml-pipeline/docs/edge-analysis-spec.md
Donna cf49154ddd Add LLM-created sequences at runtime to edge analysis spec
- Runtime policy: green-only (no YOLO on yellow)
- LLM clarification flow when wiring fails
- Edge hints payload: map, constant, drop, expression
- Structured error response for LLM to resolve issues

Conservative but flexible: LLM can provide explicit instructions
to turn yellow into green.

Co-authored-by: Dan
2026-01-26 06:59:51 +00:00

422 lines
12 KiB
Markdown

# Edge Analysis API Specification
**Status:** Draft
**Author:** Donna (with Dan)
**Date:** 2026-01-26
## Overview
When users connect nodes in the visual flow editor, an AI analyzes the schema compatibility and proposes field mappings. This provides immediate visual feedback (green/yellow/red) and reduces manual configuration for common cases.
## User Experience
### Visual Feedback
When a user draws a connection from Node A to Node B:
```
┌─────────────┐ ┌─────────────┐
│ Node A │ 🟢 ────────────────▶ │ Node B │
│ │ High confidence │ │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ Node C │ 🟡 ─ ─ ─ ─ ─ ─ ─ ─▶ │ Node D │
│ │ Review suggested │ │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ Node E │ 🔴 ─ ─ ─ ✕ ─ ─ ─ ─▶ │ Node F │
│ │ Manual mapping needed │ │
└─────────────┘ └─────────────┘
```
### Confidence Levels
| Level | Color | Threshold | Meaning |
|-------|-------|-----------|---------|
| HIGH | 🟢 Green | ≥ 0.8 | All required inputs mapped, types compatible |
| MEDIUM | 🟡 Yellow | 0.4 - 0.79 | Some mappings uncertain or missing optional fields |
| LOW | 🔴 Red | < 0.4 | Cannot determine mapping, manual intervention required |
### Interaction Flow
1. User drags connection from A output B input
2. Frontend calls `POST /api/v1/flows/{id}/analyze-edge`
3. Backend analyzes schemas (cached if previously computed)
4. Frontend renders line with confidence color
5. User clicks line mapping panel opens
6. User can accept, modify, or manually define mappings
7. Changes saved to flow's `canvas_state`
## API Endpoint
### POST /api/v1/flows/{flow_id}/analyze-edge
Analyze compatibility between two nodes and propose field mappings.
#### Request
```json
{
"from_node": "calculator.add",
"to_node": "calculator.multiply",
"from_output_schema": "<optional: override if not using node's default>",
"to_input_schema": "<optional: override if not using node's default>"
}
```
**Notes:**
- If schemas not provided, fetched from node registry
- Schemas can be XSD strings or references to registered schemas
#### Response
```json
{
"edge_id": "add_to_multiply",
"confidence": 0.85,
"level": "high",
"proposed_mapping": {
"mappings": [
{
"from_field": "output.sum",
"to_field": "input.value",
"confidence": 0.95,
"reason": "Exact name match, compatible types (int → int)"
},
{
"from_field": "output.operands",
"to_field": "input.factors",
"confidence": 0.6,
"reason": "Semantic similarity, both arrays of numbers"
}
],
"unmapped_required": [
{
"field": "input.precision",
"type": "int",
"default": 2,
"suggestion": "Set constant value or map from upstream"
}
],
"unmapped_optional": [
{
"field": "input.label",
"type": "string"
}
]
},
"warnings": [
"input.precision has no source, using default value 2",
"output.metadata will be discarded (no matching input field)"
],
"errors": [],
"analysis_method": "llm", // or "heuristic" for simple cases
"cached": false,
"analysis_time_ms": 245
}
```
#### Error Response
```json
{
"error": "schema_not_found",
"message": "Node 'calculator.add' has no registered output schema",
"details": {
"node": "calculator.add"
}
}
```
### GET /api/v1/flows/{flow_id}/edges
List all edges in a flow with their current mapping status.
#### Response
```json
{
"edges": [
{
"id": "edge_1",
"from_node": "input",
"to_node": "calculator.add",
"confidence": 0.92,
"level": "high",
"mapping_status": "auto",
"last_analyzed": "2026-01-26T06:30:00Z"
},
{
"id": "edge_2",
"from_node": "calculator.add",
"to_node": "formatter",
"confidence": 0.45,
"level": "medium",
"mapping_status": "user_modified",
"last_analyzed": "2026-01-26T06:30:00Z"
}
]
}
```
### PUT /api/v1/flows/{flow_id}/edges/{edge_id}/mapping
Save user-defined or modified mapping for an edge.
#### Request
```json
{
"mappings": [
{
"from_field": "output.sum",
"to_field": "input.value"
},
{
"to_field": "input.factor",
"constant": 5
},
{
"to_field": "input.label",
"expression": "concat('Result: ', output.sum)"
}
],
"user_confirmed": true
}
```
## Analysis Engine
### Heuristic Analysis (Fast Path)
Used when schemas are simple and mapping is obvious:
1. **Exact name match** `output.value` `input.value` (confidence: 0.95)
2. **Case-insensitive match** `output.Value` `input.value` (confidence: 0.9)
3. **Common aliases** `output.result` `input.value` (confidence: 0.7)
4. **Type compatibility** int float OK, string int NOT OK
### LLM Analysis (Deep Path)
Used when heuristics produce low confidence:
```
System: You are analyzing data flow compatibility between two XML schemas.
Given:
- Source schema (output of previous step): {from_schema}
- Target schema (input of next step): {to_schema}
Propose a field mapping. For each target field, identify:
1. The best source field to map from (if any)
2. Confidence (0-1) in the mapping
3. Brief reason for the mapping
If a required target field cannot be mapped, flag it.
If source fields will be discarded, note them.
Respond in JSON format.
```
### Caching Strategy
- Cache key: `hash(from_schema) + hash(to_schema)`
- TTL: 24 hours (schemas rarely change)
- Invalidate on: node schema update, user clear cache
- Store: Redis or in-memory LRU
## Database Schema
### Edge Mappings Table
```sql
CREATE TABLE edge_mappings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flow_id UUID NOT NULL REFERENCES flows(id) ON DELETE CASCADE,
-- Edge identification
from_node VARCHAR(100) NOT NULL,
to_node VARCHAR(100) NOT NULL,
-- Analysis results
confidence NUMERIC(3,2),
level VARCHAR(10), -- 'high', 'medium', 'low'
analysis_method VARCHAR(20), -- 'heuristic', 'llm'
-- The actual mapping (JSON)
proposed_mapping JSONB,
user_mapping JSONB, -- User overrides, if any
-- Status
user_confirmed BOOLEAN DEFAULT FALSE,
-- Timestamps
analyzed_at TIMESTAMP WITH TIME ZONE,
confirmed_at TIMESTAMP WITH TIME ZONE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(flow_id, from_node, to_node)
);
CREATE INDEX idx_edge_mappings_flow ON edge_mappings(flow_id);
```
## Sequencer Integration
When a sequence is executed, the sequencer factory:
1. Loads all edge mappings for the flow
2. For each edge, generates a transformer function:
```python
def generate_transformer(edge_mapping: EdgeMapping) -> Callable:
"""
Generate a function that transforms A's output to B's input.
"""
def transform(source_xml: str) -> str:
source = parse_xml(source_xml)
target = {}
for mapping in edge_mapping.effective_mapping:
if mapping.from_field:
target[mapping.to_field] = extract(source, mapping.from_field)
elif mapping.constant is not None:
target[mapping.to_field] = mapping.constant
elif mapping.expression:
target[mapping.to_field] = evaluate(mapping.expression, source)
return serialize_xml(target, edge_mapping.to_schema)
return transform
```
3. Transformer is called between each step in the sequence
## LLM-Created Sequences (Runtime)
Agents can dynamically create sequences at runtime. The sequencer factory runs the same analysis but with stricter rules.
### Runtime Policy
| Confidence | Design-time (Canvas) | Run-time (LLM) |
|------------|---------------------|----------------|
| 🟢 High | Auto-wire, run | Auto-wire, run |
| 🟡 Medium | Show warning, let user decide | **Block**, request clarification |
| 🔴 Low | Show error, require manual | **Block**, request clarification |
**Rationale:** Letting LLMs YOLO on uncertain mappings is risky. Better to ask for explicit instructions.
### LLM Clarification Flow
```
1. LLM requests sequence: A → B → C
2. Factory analyzes:
- A → B: green ✓
- B → C: yellow (ambiguous)
3. Factory responds with structured error:
- Which edge failed
- What B outputs
- What C expects
- Suggested resolutions
4. LLM provides hints:
- Explicit mappings
- Constants for missing fields
- Fields to drop
5. Factory re-analyzes with hints
6. If green: run. If not: back to step 3.
```
### Edge Hints Payload
LLMs can provide explicit wiring instructions:
```xml
<CreateSequence>
<steps>transformer, uploader</steps>
<initial_payload>...</initial_payload>
<edge_hints>
<edge from="transformer" to="uploader">
<map from="result" to="payload"/>
<constant field="destination">/uploads/output.json</constant>
<drop field="factor"/>
<drop field="metadata"/>
</edge>
</edge_hints>
</CreateSequence>
```
### Hint Types
| Hint | Syntax | Effect |
|------|--------|--------|
| Map field | `<map from="X" to="Y"/>` | Wire source field to target field |
| Set constant | `<constant field="Y">value</constant>` | Set target field to literal value |
| Drop field | `<drop field="X"/>` | Explicitly ignore source field |
| Expression | `<expr field="Y">concat(X.a, X.b)</expr>` | Compute target from expression |
### Error Response to LLM
When wiring fails:
```xml
<SequenceError>
<code>wiring_failed</code>
<edge from="transformer" to="uploader"/>
<source_fields>
<field name="result" type="string"/>
<field name="factor" type="int"/>
<field name="metadata" type="object"/>
</source_fields>
<target_fields>
<field name="payload" type="string" required="true" mapped="result" confidence="0.85"/>
<field name="destination" type="string" required="true" mapped="" confidence="0"/>
</target_fields>
<issues>
<issue type="unmapped_required">destination has no source</issue>
<issue type="unmapped_source">factor will be dropped</issue>
<issue type="unmapped_source">metadata will be dropped</issue>
</issues>
<suggestion>Provide mapping for 'destination' or set as constant.</suggestion>
</SequenceError>
```
This gives the LLM enough information to either:
- Provide the missing hints
- Try a different sequence
- Ask the user for help
## Future Enhancements
### v1.1 — Type Coercion
- Automatic int string, date formatting, etc.
- Warnings when lossy conversion occurs
### v1.2 — Expression Builder
- Visual expression editor for complex mappings
- Functions: `concat()`, `format()`, `split()`, `lookup()`
### v1.3 — Learning from Corrections
- Track when users override AI suggestions
- Fine-tune confidence thresholds
- Eventually: personalized mapping suggestions
### v2.0 — Multi-Output Nodes
- Some nodes produce multiple outputs
- UI shows multiple output ports
- User wires specific port to specific input
---
*This spec is a living document. Update as implementation progresses.*