dullfig 515c738abb Add wiki documentation for xml-pipeline.org

Comprehensive documentation set for XWiki:
- Home, Installation, Quick Start guides
- Writing Handlers and LLM Router guides
- Architecture docs (Overview, Message Pump, Thread Registry, Shared Backend)
- Reference docs (Configuration, Handler Contract, CLI)
- Hello World tutorial
- Why XML rationale
- Pandoc conversion scripts (bash + PowerShell)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-20 20:40:47 -08:00

5.1 KiB

Raw Blame History

Why XML?

XML is the right format for a sovereign, attack-resistant message bus in a multi-agent system. JSON is not.

The Short Answer

Feature	XML	JSON
Schema validation	XSD (built-in, precise)	JSON Schema (optional, lossy)
Namespaces	Native support	None
Canonicalization	C14N standard	No standard
Repair tolerance	lxml recover mode	Parser fails
Comments	Supported	Forbidden
Mixed content	Native	Fragile

JSON's Origins

JSON (JavaScript Object Notation) was invented in the early 2000s as a subset of JavaScript literal syntax for simple data exchange in web browsers. It was never designed as a general-purpose format—just a quick way to serialize objects for Ajax calls.

It became popular because:

Simple for JavaScript developers
Human-readable
Web API boom (REST over SOAP)
Low barrier to entry

Why JSON Fails for Multi-Agent Systems

No Schema Enforcement

JSON Schema exists but is:

Optional (rarely enforced on wire)
Lossy (can't express all constraints)
Inconsistently implemented

Result: Messages accepted without validation, bugs discovered at runtime.

No Namespaces

Can't safely mix vocabularies:

{
  "name": "Alice",      // User name? Product name?
  "type": "admin"       // User type? Message type?
}

No Canonicalization

No standard way to normalize for signing:

{"a": 1, "b": 2}
{"b": 2, "a": 1}

Same data? Different bytes. Can't sign reliably.

No Repair Tolerance

One syntax error → entire payload rejected:

{"name": "Alice",}     // Trailing comma → FAIL

Escaping Hell

Strings with special characters are fragile:

{"message": "She said \"hello\""}   // Manual escaping

Easy to break, security vulnerability vector.

Why JSON Fails for LLM Integration

Hallucination Fragility

LLMs routinely produce invalid JSON:

Trailing commas
Missing quotes
Wrong nesting
Comments (forbidden!)

Result: Massive prompt bloat ("You MUST output valid JSON, NO trailing commas EVER...") and post-processing parsers.

No Graceful Degradation

One parse error → entire response lost. No partial recovery.

Injection Attacks

User input in strings can break JSON structure:

{"user_input": "Alice", "role": "admin"}

If user provides ", "role": "admin" in their name → injection.

Why XML Succeeds

Schema as Contract

XSD enforces exact structure on the wire:

<xs:element name="greeting">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Every message validated before processing. No ambiguity.

Namespaces

Safe vocabulary mixing:

<message xmlns="https://xml-pipeline.org/ns/envelope/v1">
  <user:profile xmlns:user="https://example.org/user">
    <user:name>Alice</user:name>
  </user:profile>
</message>

Canonicalization (C14N)

Deterministic representation for signing:

c14n_bytes = etree.tostring(tree, method='c14n')
signature = sign(c14n_bytes)

Same logical content → same bytes → verifiable signatures.

Repair Tolerance

lxml recover mode fixes common issues:

parser = etree.XMLParser(recover=True)
tree = etree.fromstring(broken_xml, parser)

Partial documents, encoding issues, missing tags → recovered.

Self-Describing

Elements carry meaning:

<greeting>
  <name>Alice</name>
</greeting>

vs JSON:

["Alice"]  // What is this?

LLM + XML = Reliable

Natural Streaming

XML streams naturally (can process before complete).

Repair on Output

LLM produces broken XML? lxml fixes it:

from lxml import etree

parser = etree.XMLParser(recover=True)
tree = etree.fromstring(llm_output, parser)
# Works even with minor errors

Schema-Guided Generation

XSD tells LLM exactly what to produce:

Generate XML matching this schema:
<greeting><name>string</name></greeting>

Clear contract, fewer hallucinations.

Graceful Validation

Validation errors become helpful feedback:

<huh>
  <error>Element 'greeting' missing required element 'name'</error>
</huh>

LLM can self-correct.

The Trade-Offs

XML is More Verbose

<greeting><name>Alice</name></greeting>

{"name": "Alice"}

But: Compression eliminates this on wire. And verbosity aids debugging.

XML Parsing is Slower

Microseconds more than JSON parsing.

But: Network latency dominates. And lxml is highly optimized.

XML is "Old"

True. Also mature, battle-tested, standards-based.

Conclusion

JSON won the web because it was "good enough" for stateless HTTP requests.

XML wins for multi-agent systems because:

Security requires schema enforcement
Signing requires canonicalization
LLMs require repair tolerance
Complexity requires namespaces

JSON won the web. XML wins the swarm.

5.1 KiB Raw Blame History