xml-pipeline/docs/edge-analysis-spec.md
Donna 4aa40ed29b Add edge analysis API spec - AI-assisted field mapping
- POST /api/v1/flows/{id}/analyze-edge endpoint spec
- Confidence levels: high (green), medium (yellow), low (red)
- Heuristic + LLM analysis paths
- Database schema for edge_mappings
- Sequencer integration notes
- Future enhancements roadmap

Co-authored-by: Dan
2026-01-26 06:46:17 +00:00

324 lines
8.9 KiB
Markdown

# Edge Analysis API Specification
**Status:** Draft
**Author:** Donna (with Dan)
**Date:** 2026-01-26
## Overview
When users connect nodes in the visual flow editor, an AI analyzes the schema compatibility and proposes field mappings. This provides immediate visual feedback (green/yellow/red) and reduces manual configuration for common cases.
## User Experience
### Visual Feedback
When a user draws a connection from Node A to Node B:
```
┌─────────────┐ ┌─────────────┐
│ Node A │ 🟢 ────────────────▶ │ Node B │
│ │ High confidence │ │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ Node C │ 🟡 ─ ─ ─ ─ ─ ─ ─ ─▶ │ Node D │
│ │ Review suggested │ │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ Node E │ 🔴 ─ ─ ─ ✕ ─ ─ ─ ─▶ │ Node F │
│ │ Manual mapping needed │ │
└─────────────┘ └─────────────┘
```
### Confidence Levels
| Level | Color | Threshold | Meaning |
|-------|-------|-----------|---------|
| HIGH | 🟢 Green | ≥ 0.8 | All required inputs mapped, types compatible |
| MEDIUM | 🟡 Yellow | 0.4 - 0.79 | Some mappings uncertain or missing optional fields |
| LOW | 🔴 Red | < 0.4 | Cannot determine mapping, manual intervention required |
### Interaction Flow
1. User drags connection from A output B input
2. Frontend calls `POST /api/v1/flows/{id}/analyze-edge`
3. Backend analyzes schemas (cached if previously computed)
4. Frontend renders line with confidence color
5. User clicks line mapping panel opens
6. User can accept, modify, or manually define mappings
7. Changes saved to flow's `canvas_state`
## API Endpoint
### POST /api/v1/flows/{flow_id}/analyze-edge
Analyze compatibility between two nodes and propose field mappings.
#### Request
```json
{
"from_node": "calculator.add",
"to_node": "calculator.multiply",
"from_output_schema": "<optional: override if not using node's default>",
"to_input_schema": "<optional: override if not using node's default>"
}
```
**Notes:**
- If schemas not provided, fetched from node registry
- Schemas can be XSD strings or references to registered schemas
#### Response
```json
{
"edge_id": "add_to_multiply",
"confidence": 0.85,
"level": "high",
"proposed_mapping": {
"mappings": [
{
"from_field": "output.sum",
"to_field": "input.value",
"confidence": 0.95,
"reason": "Exact name match, compatible types (int → int)"
},
{
"from_field": "output.operands",
"to_field": "input.factors",
"confidence": 0.6,
"reason": "Semantic similarity, both arrays of numbers"
}
],
"unmapped_required": [
{
"field": "input.precision",
"type": "int",
"default": 2,
"suggestion": "Set constant value or map from upstream"
}
],
"unmapped_optional": [
{
"field": "input.label",
"type": "string"
}
]
},
"warnings": [
"input.precision has no source, using default value 2",
"output.metadata will be discarded (no matching input field)"
],
"errors": [],
"analysis_method": "llm", // or "heuristic" for simple cases
"cached": false,
"analysis_time_ms": 245
}
```
#### Error Response
```json
{
"error": "schema_not_found",
"message": "Node 'calculator.add' has no registered output schema",
"details": {
"node": "calculator.add"
}
}
```
### GET /api/v1/flows/{flow_id}/edges
List all edges in a flow with their current mapping status.
#### Response
```json
{
"edges": [
{
"id": "edge_1",
"from_node": "input",
"to_node": "calculator.add",
"confidence": 0.92,
"level": "high",
"mapping_status": "auto",
"last_analyzed": "2026-01-26T06:30:00Z"
},
{
"id": "edge_2",
"from_node": "calculator.add",
"to_node": "formatter",
"confidence": 0.45,
"level": "medium",
"mapping_status": "user_modified",
"last_analyzed": "2026-01-26T06:30:00Z"
}
]
}
```
### PUT /api/v1/flows/{flow_id}/edges/{edge_id}/mapping
Save user-defined or modified mapping for an edge.
#### Request
```json
{
"mappings": [
{
"from_field": "output.sum",
"to_field": "input.value"
},
{
"to_field": "input.factor",
"constant": 5
},
{
"to_field": "input.label",
"expression": "concat('Result: ', output.sum)"
}
],
"user_confirmed": true
}
```
## Analysis Engine
### Heuristic Analysis (Fast Path)
Used when schemas are simple and mapping is obvious:
1. **Exact name match** `output.value` `input.value` (confidence: 0.95)
2. **Case-insensitive match** `output.Value` `input.value` (confidence: 0.9)
3. **Common aliases** `output.result` `input.value` (confidence: 0.7)
4. **Type compatibility** int float OK, string int NOT OK
### LLM Analysis (Deep Path)
Used when heuristics produce low confidence:
```
System: You are analyzing data flow compatibility between two XML schemas.
Given:
- Source schema (output of previous step): {from_schema}
- Target schema (input of next step): {to_schema}
Propose a field mapping. For each target field, identify:
1. The best source field to map from (if any)
2. Confidence (0-1) in the mapping
3. Brief reason for the mapping
If a required target field cannot be mapped, flag it.
If source fields will be discarded, note them.
Respond in JSON format.
```
### Caching Strategy
- Cache key: `hash(from_schema) + hash(to_schema)`
- TTL: 24 hours (schemas rarely change)
- Invalidate on: node schema update, user clear cache
- Store: Redis or in-memory LRU
## Database Schema
### Edge Mappings Table
```sql
CREATE TABLE edge_mappings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
flow_id UUID NOT NULL REFERENCES flows(id) ON DELETE CASCADE,
-- Edge identification
from_node VARCHAR(100) NOT NULL,
to_node VARCHAR(100) NOT NULL,
-- Analysis results
confidence NUMERIC(3,2),
level VARCHAR(10), -- 'high', 'medium', 'low'
analysis_method VARCHAR(20), -- 'heuristic', 'llm'
-- The actual mapping (JSON)
proposed_mapping JSONB,
user_mapping JSONB, -- User overrides, if any
-- Status
user_confirmed BOOLEAN DEFAULT FALSE,
-- Timestamps
analyzed_at TIMESTAMP WITH TIME ZONE,
confirmed_at TIMESTAMP WITH TIME ZONE,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(flow_id, from_node, to_node)
);
CREATE INDEX idx_edge_mappings_flow ON edge_mappings(flow_id);
```
## Sequencer Integration
When a sequence is executed, the sequencer factory:
1. Loads all edge mappings for the flow
2. For each edge, generates a transformer function:
```python
def generate_transformer(edge_mapping: EdgeMapping) -> Callable:
"""
Generate a function that transforms A's output to B's input.
"""
def transform(source_xml: str) -> str:
source = parse_xml(source_xml)
target = {}
for mapping in edge_mapping.effective_mapping:
if mapping.from_field:
target[mapping.to_field] = extract(source, mapping.from_field)
elif mapping.constant is not None:
target[mapping.to_field] = mapping.constant
elif mapping.expression:
target[mapping.to_field] = evaluate(mapping.expression, source)
return serialize_xml(target, edge_mapping.to_schema)
return transform
```
3. Transformer is called between each step in the sequence
## Future Enhancements
### v1.1 — Type Coercion
- Automatic int string, date formatting, etc.
- Warnings when lossy conversion occurs
### v1.2 — Expression Builder
- Visual expression editor for complex mappings
- Functions: `concat()`, `format()`, `split()`, `lookup()`
### v1.3 — Learning from Corrections
- Track when users override AI suggestions
- Fine-tune confidence thresholds
- Eventually: personalized mapping suggestions
### v2.0 — Multi-Output Nodes
- Some nodes produce multiple outputs
- UI shows multiple output ports
- User wires specific port to specific input
---
*This spec is a living document. Update as implementation progresses.*