xml-pipeline/docs/edge-analysis-spec.md
Donna 4aa40ed29b Add edge analysis API spec - AI-assisted field mapping
- POST /api/v1/flows/{id}/analyze-edge endpoint spec
- Confidence levels: high (green), medium (yellow), low (red)
- Heuristic + LLM analysis paths
- Database schema for edge_mappings
- Sequencer integration notes
- Future enhancements roadmap

Co-authored-by: Dan
2026-01-26 06:46:17 +00:00

8.9 KiB

Edge Analysis API Specification

Status: Draft
Author: Donna (with Dan)
Date: 2026-01-26

Overview

When users connect nodes in the visual flow editor, an AI analyzes the schema compatibility and proposes field mappings. This provides immediate visual feedback (green/yellow/red) and reduces manual configuration for common cases.

User Experience

Visual Feedback

When a user draws a connection from Node A to Node B:

┌─────────────┐                           ┌─────────────┐
│   Node A    │    🟢 ────────────────▶   │   Node B    │
│             │    High confidence        │             │
└─────────────┘                           └─────────────┘

┌─────────────┐                           ┌─────────────┐
│   Node C    │    🟡 ─ ─ ─ ─ ─ ─ ─ ─▶   │   Node D    │
│             │    Review suggested       │             │
└─────────────┘                           └─────────────┘

┌─────────────┐                           ┌─────────────┐
│   Node E    │    🔴 ─ ─ ─ ✕ ─ ─ ─ ─▶   │   Node F    │
│             │    Manual mapping needed  │             │
└─────────────┘                           └─────────────┘

Confidence Levels

Level Color Threshold Meaning
HIGH 🟢 Green ≥ 0.8 All required inputs mapped, types compatible
MEDIUM 🟡 Yellow 0.4 - 0.79 Some mappings uncertain or missing optional fields
LOW 🔴 Red < 0.4 Cannot determine mapping, manual intervention required

Interaction Flow

  1. User drags connection from A output → B input
  2. Frontend calls POST /api/v1/flows/{id}/analyze-edge
  3. Backend analyzes schemas (cached if previously computed)
  4. Frontend renders line with confidence color
  5. User clicks line → mapping panel opens
  6. User can accept, modify, or manually define mappings
  7. Changes saved to flow's canvas_state

API Endpoint

POST /api/v1/flows/{flow_id}/analyze-edge

Analyze compatibility between two nodes and propose field mappings.

Request

{
  "from_node": "calculator.add",
  "to_node": "calculator.multiply",
  "from_output_schema": "<optional: override if not using node's default>",
  "to_input_schema": "<optional: override if not using node's default>"
}

Notes:

  • If schemas not provided, fetched from node registry
  • Schemas can be XSD strings or references to registered schemas

Response

{
  "edge_id": "add_to_multiply",
  "confidence": 0.85,
  "level": "high",
  
  "proposed_mapping": {
    "mappings": [
      {
        "from_field": "output.sum",
        "to_field": "input.value",
        "confidence": 0.95,
        "reason": "Exact name match, compatible types (int → int)"
      },
      {
        "from_field": "output.operands",
        "to_field": "input.factors",
        "confidence": 0.6,
        "reason": "Semantic similarity, both arrays of numbers"
      }
    ],
    "unmapped_required": [
      {
        "field": "input.precision",
        "type": "int",
        "default": 2,
        "suggestion": "Set constant value or map from upstream"
      }
    ],
    "unmapped_optional": [
      {
        "field": "input.label",
        "type": "string"
      }
    ]
  },
  
  "warnings": [
    "input.precision has no source, using default value 2",
    "output.metadata will be discarded (no matching input field)"
  ],
  
  "errors": [],
  
  "analysis_method": "llm",  // or "heuristic" for simple cases
  "cached": false,
  "analysis_time_ms": 245
}

Error Response

{
  "error": "schema_not_found",
  "message": "Node 'calculator.add' has no registered output schema",
  "details": {
    "node": "calculator.add"
  }
}

GET /api/v1/flows/{flow_id}/edges

List all edges in a flow with their current mapping status.

Response

{
  "edges": [
    {
      "id": "edge_1",
      "from_node": "input",
      "to_node": "calculator.add",
      "confidence": 0.92,
      "level": "high",
      "mapping_status": "auto",
      "last_analyzed": "2026-01-26T06:30:00Z"
    },
    {
      "id": "edge_2", 
      "from_node": "calculator.add",
      "to_node": "formatter",
      "confidence": 0.45,
      "level": "medium",
      "mapping_status": "user_modified",
      "last_analyzed": "2026-01-26T06:30:00Z"
    }
  ]
}

PUT /api/v1/flows/{flow_id}/edges/{edge_id}/mapping

Save user-defined or modified mapping for an edge.

Request

{
  "mappings": [
    {
      "from_field": "output.sum",
      "to_field": "input.value"
    },
    {
      "to_field": "input.factor",
      "constant": 5
    },
    {
      "to_field": "input.label",
      "expression": "concat('Result: ', output.sum)"
    }
  ],
  "user_confirmed": true
}

Analysis Engine

Heuristic Analysis (Fast Path)

Used when schemas are simple and mapping is obvious:

  1. Exact name matchoutput.valueinput.value (confidence: 0.95)
  2. Case-insensitive matchoutput.Valueinput.value (confidence: 0.9)
  3. Common aliasesoutput.resultinput.value (confidence: 0.7)
  4. Type compatibility — int → float OK, string → int NOT OK

LLM Analysis (Deep Path)

Used when heuristics produce low confidence:

System: You are analyzing data flow compatibility between two XML schemas.

Given:
- Source schema (output of previous step): {from_schema}
- Target schema (input of next step): {to_schema}

Propose a field mapping. For each target field, identify:
1. The best source field to map from (if any)
2. Confidence (0-1) in the mapping
3. Brief reason for the mapping

If a required target field cannot be mapped, flag it.
If source fields will be discarded, note them.

Respond in JSON format.

Caching Strategy

  • Cache key: hash(from_schema) + hash(to_schema)
  • TTL: 24 hours (schemas rarely change)
  • Invalidate on: node schema update, user clear cache
  • Store: Redis or in-memory LRU

Database Schema

Edge Mappings Table

CREATE TABLE edge_mappings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    flow_id UUID NOT NULL REFERENCES flows(id) ON DELETE CASCADE,
    
    -- Edge identification
    from_node VARCHAR(100) NOT NULL,
    to_node VARCHAR(100) NOT NULL,
    
    -- Analysis results
    confidence NUMERIC(3,2),
    level VARCHAR(10),  -- 'high', 'medium', 'low'
    analysis_method VARCHAR(20),  -- 'heuristic', 'llm'
    
    -- The actual mapping (JSON)
    proposed_mapping JSONB,
    user_mapping JSONB,  -- User overrides, if any
    
    -- Status
    user_confirmed BOOLEAN DEFAULT FALSE,
    
    -- Timestamps
    analyzed_at TIMESTAMP WITH TIME ZONE,
    confirmed_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    
    UNIQUE(flow_id, from_node, to_node)
);

CREATE INDEX idx_edge_mappings_flow ON edge_mappings(flow_id);

Sequencer Integration

When a sequence is executed, the sequencer factory:

  1. Loads all edge mappings for the flow
  2. For each edge, generates a transformer function:
def generate_transformer(edge_mapping: EdgeMapping) -> Callable:
    """
    Generate a function that transforms A's output to B's input.
    """
    def transform(source_xml: str) -> str:
        source = parse_xml(source_xml)
        target = {}
        
        for mapping in edge_mapping.effective_mapping:
            if mapping.from_field:
                target[mapping.to_field] = extract(source, mapping.from_field)
            elif mapping.constant is not None:
                target[mapping.to_field] = mapping.constant
            elif mapping.expression:
                target[mapping.to_field] = evaluate(mapping.expression, source)
        
        return serialize_xml(target, edge_mapping.to_schema)
    
    return transform
  1. Transformer is called between each step in the sequence

Future Enhancements

v1.1 — Type Coercion

  • Automatic int → string, date formatting, etc.
  • Warnings when lossy conversion occurs

v1.2 — Expression Builder

  • Visual expression editor for complex mappings
  • Functions: concat(), format(), split(), lookup()

v1.3 — Learning from Corrections

  • Track when users override AI suggestions
  • Fine-tune confidence thresholds
  • Eventually: personalized mapping suggestions

v2.0 — Multi-Output Nodes

  • Some nodes produce multiple outputs
  • UI shows multiple output ports
  • User wires specific port to specific input

This spec is a living document. Update as implementation progresses.