xml-pipeline/docs/edge-analysis-spec.md
Donna cf49154ddd Add LLM-created sequences at runtime to edge analysis spec
- Runtime policy: green-only (no YOLO on yellow)
- LLM clarification flow when wiring fails
- Edge hints payload: map, constant, drop, expression
- Structured error response for LLM to resolve issues

Conservative but flexible: LLM can provide explicit instructions
to turn yellow into green.

Co-authored-by: Dan
2026-01-26 06:59:51 +00:00

12 KiB

Edge Analysis API Specification

Status: Draft
Author: Donna (with Dan)
Date: 2026-01-26

Overview

When users connect nodes in the visual flow editor, an AI analyzes the schema compatibility and proposes field mappings. This provides immediate visual feedback (green/yellow/red) and reduces manual configuration for common cases.

User Experience

Visual Feedback

When a user draws a connection from Node A to Node B:

┌─────────────┐                           ┌─────────────┐
│   Node A    │    🟢 ────────────────▶   │   Node B    │
│             │    High confidence        │             │
└─────────────┘                           └─────────────┘

┌─────────────┐                           ┌─────────────┐
│   Node C    │    🟡 ─ ─ ─ ─ ─ ─ ─ ─▶   │   Node D    │
│             │    Review suggested       │             │
└─────────────┘                           └─────────────┘

┌─────────────┐                           ┌─────────────┐
│   Node E    │    🔴 ─ ─ ─ ✕ ─ ─ ─ ─▶   │   Node F    │
│             │    Manual mapping needed  │             │
└─────────────┘                           └─────────────┘

Confidence Levels

Level Color Threshold Meaning
HIGH 🟢 Green ≥ 0.8 All required inputs mapped, types compatible
MEDIUM 🟡 Yellow 0.4 - 0.79 Some mappings uncertain or missing optional fields
LOW 🔴 Red < 0.4 Cannot determine mapping, manual intervention required

Interaction Flow

  1. User drags connection from A output → B input
  2. Frontend calls POST /api/v1/flows/{id}/analyze-edge
  3. Backend analyzes schemas (cached if previously computed)
  4. Frontend renders line with confidence color
  5. User clicks line → mapping panel opens
  6. User can accept, modify, or manually define mappings
  7. Changes saved to flow's canvas_state

API Endpoint

POST /api/v1/flows/{flow_id}/analyze-edge

Analyze compatibility between two nodes and propose field mappings.

Request

{
  "from_node": "calculator.add",
  "to_node": "calculator.multiply",
  "from_output_schema": "<optional: override if not using node's default>",
  "to_input_schema": "<optional: override if not using node's default>"
}

Notes:

  • If schemas not provided, fetched from node registry
  • Schemas can be XSD strings or references to registered schemas

Response

{
  "edge_id": "add_to_multiply",
  "confidence": 0.85,
  "level": "high",
  
  "proposed_mapping": {
    "mappings": [
      {
        "from_field": "output.sum",
        "to_field": "input.value",
        "confidence": 0.95,
        "reason": "Exact name match, compatible types (int → int)"
      },
      {
        "from_field": "output.operands",
        "to_field": "input.factors",
        "confidence": 0.6,
        "reason": "Semantic similarity, both arrays of numbers"
      }
    ],
    "unmapped_required": [
      {
        "field": "input.precision",
        "type": "int",
        "default": 2,
        "suggestion": "Set constant value or map from upstream"
      }
    ],
    "unmapped_optional": [
      {
        "field": "input.label",
        "type": "string"
      }
    ]
  },
  
  "warnings": [
    "input.precision has no source, using default value 2",
    "output.metadata will be discarded (no matching input field)"
  ],
  
  "errors": [],
  
  "analysis_method": "llm",  // or "heuristic" for simple cases
  "cached": false,
  "analysis_time_ms": 245
}

Error Response

{
  "error": "schema_not_found",
  "message": "Node 'calculator.add' has no registered output schema",
  "details": {
    "node": "calculator.add"
  }
}

GET /api/v1/flows/{flow_id}/edges

List all edges in a flow with their current mapping status.

Response

{
  "edges": [
    {
      "id": "edge_1",
      "from_node": "input",
      "to_node": "calculator.add",
      "confidence": 0.92,
      "level": "high",
      "mapping_status": "auto",
      "last_analyzed": "2026-01-26T06:30:00Z"
    },
    {
      "id": "edge_2", 
      "from_node": "calculator.add",
      "to_node": "formatter",
      "confidence": 0.45,
      "level": "medium",
      "mapping_status": "user_modified",
      "last_analyzed": "2026-01-26T06:30:00Z"
    }
  ]
}

PUT /api/v1/flows/{flow_id}/edges/{edge_id}/mapping

Save user-defined or modified mapping for an edge.

Request

{
  "mappings": [
    {
      "from_field": "output.sum",
      "to_field": "input.value"
    },
    {
      "to_field": "input.factor",
      "constant": 5
    },
    {
      "to_field": "input.label",
      "expression": "concat('Result: ', output.sum)"
    }
  ],
  "user_confirmed": true
}

Analysis Engine

Heuristic Analysis (Fast Path)

Used when schemas are simple and mapping is obvious:

  1. Exact name matchoutput.valueinput.value (confidence: 0.95)
  2. Case-insensitive matchoutput.Valueinput.value (confidence: 0.9)
  3. Common aliasesoutput.resultinput.value (confidence: 0.7)
  4. Type compatibility — int → float OK, string → int NOT OK

LLM Analysis (Deep Path)

Used when heuristics produce low confidence:

System: You are analyzing data flow compatibility between two XML schemas.

Given:
- Source schema (output of previous step): {from_schema}
- Target schema (input of next step): {to_schema}

Propose a field mapping. For each target field, identify:
1. The best source field to map from (if any)
2. Confidence (0-1) in the mapping
3. Brief reason for the mapping

If a required target field cannot be mapped, flag it.
If source fields will be discarded, note them.

Respond in JSON format.

Caching Strategy

  • Cache key: hash(from_schema) + hash(to_schema)
  • TTL: 24 hours (schemas rarely change)
  • Invalidate on: node schema update, user clear cache
  • Store: Redis or in-memory LRU

Database Schema

Edge Mappings Table

CREATE TABLE edge_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    flow_id UUID NOT NULL REFERENCES flows(id) ON DELETE CASCADE,
    
    -- Edge identification
    from_node VARCHAR(100) NOT NULL,
    to_node VARCHAR(100) NOT NULL,
    
    -- Analysis results
    confidence NUMERIC(3,2),
    level VARCHAR(10),  -- 'high', 'medium', 'low'
    analysis_method VARCHAR(20),  -- 'heuristic', 'llm'
    
    -- The actual mapping (JSON)
    proposed_mapping JSONB,
    user_mapping JSONB,  -- User overrides, if any
    
    -- Status
    user_confirmed BOOLEAN DEFAULT FALSE,
    
    -- Timestamps
    analyzed_at TIMESTAMP WITH TIME ZONE,
    confirmed_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    
    UNIQUE(flow_id, from_node, to_node)
);

CREATE INDEX idx_edge_mappings_flow ON edge_mappings(flow_id);

Sequencer Integration

When a sequence is executed, the sequencer factory:

  1. Loads all edge mappings for the flow
  2. For each edge, generates a transformer function:
def generate_transformer(edge_mapping: EdgeMapping) -> Callable:
    """
    Generate a function that transforms A's output to B's input.
    """
    def transform(source_xml: str) -> str:
        source = parse_xml(source_xml)
        target = {}
        
        for mapping in edge_mapping.effective_mapping:
            if mapping.from_field:
                target[mapping.to_field] = extract(source, mapping.from_field)
            elif mapping.constant is not None:
                target[mapping.to_field] = mapping.constant
            elif mapping.expression:
                target[mapping.to_field] = evaluate(mapping.expression, source)
        
        return serialize_xml(target, edge_mapping.to_schema)
    
    return transform
  1. Transformer is called between each step in the sequence

LLM-Created Sequences (Runtime)

Agents can dynamically create sequences at runtime. The sequencer factory runs the same analysis but with stricter rules.

Runtime Policy

Confidence Design-time (Canvas) Run-time (LLM)
🟢 High Auto-wire, run Auto-wire, run
🟡 Medium Show warning, let user decide Block, request clarification
🔴 Low Show error, require manual Block, request clarification

Rationale: Letting LLMs YOLO on uncertain mappings is risky. Better to ask for explicit instructions.

LLM Clarification Flow

1. LLM requests sequence: A → B → C
2. Factory analyzes:
   - A → B: green ✓
   - B → C: yellow (ambiguous)
3. Factory responds with structured error:
   - Which edge failed
   - What B outputs
   - What C expects  
   - Suggested resolutions
4. LLM provides hints:
   - Explicit mappings
   - Constants for missing fields
   - Fields to drop
5. Factory re-analyzes with hints
6. If green: run. If not: back to step 3.

Edge Hints Payload

LLMs can provide explicit wiring instructions:

<CreateSequence>
  <steps>transformer, uploader</steps>
  <initial_payload>...</initial_payload>
  
  <edge_hints>
    <edge from="transformer" to="uploader">
      <map from="result" to="payload"/>
      <constant field="destination">/uploads/output.json</constant>
      <drop field="factor"/>
      <drop field="metadata"/>
    </edge>
  </edge_hints>
</CreateSequence>

Hint Types

Hint Syntax Effect
Map field <map from="X" to="Y"/> Wire source field to target field
Set constant <constant field="Y">value</constant> Set target field to literal value
Drop field <drop field="X"/> Explicitly ignore source field
Expression <expr field="Y">concat(X.a, X.b)</expr> Compute target from expression

Error Response to LLM

When wiring fails:

<SequenceError>
  <code>wiring_failed</code>
  <edge from="transformer" to="uploader"/>
  
  <source_fields>
    <field name="result" type="string"/>
    <field name="factor" type="int"/>
    <field name="metadata" type="object"/>
  </source_fields>
  
  <target_fields>
    <field name="payload" type="string" required="true" mapped="result" confidence="0.85"/>
    <field name="destination" type="string" required="true" mapped="" confidence="0"/>
  </target_fields>
  
  <issues>
    <issue type="unmapped_required">destination has no source</issue>
    <issue type="unmapped_source">factor will be dropped</issue>
    <issue type="unmapped_source">metadata will be dropped</issue>
  </issues>
  
  <suggestion>Provide mapping for 'destination' or set as constant.</suggestion>
</SequenceError>

This gives the LLM enough information to either:

  • Provide the missing hints
  • Try a different sequence
  • Ask the user for help

Future Enhancements

v1.1 — Type Coercion

  • Automatic int → string, date formatting, etc.
  • Warnings when lossy conversion occurs

v1.2 — Expression Builder

  • Visual expression editor for complex mappings
  • Functions: concat(), format(), split(), lookup()

v1.3 — Learning from Corrections

  • Track when users override AI suggestions
  • Fine-tune confidence thresholds
  • Eventually: personalized mapping suggestions

v2.0 — Multi-Output Nodes

  • Some nodes produce multiple outputs
  • UI shows multiple output ports
  • User wires specific port to specific input

This spec is a living document. Update as implementation progresses.