xml-pipeline/docs/wiki/Why-XML.md
dullfig 515c738abb Add wiki documentation for xml-pipeline.org
Comprehensive documentation set for XWiki:
- Home, Installation, Quick Start guides
- Writing Handlers and LLM Router guides
- Architecture docs (Overview, Message Pump, Thread Registry, Shared Backend)
- Reference docs (Configuration, Handler Contract, CLI)
- Hello World tutorial
- Why XML rationale
- Pandoc conversion scripts (bash + PowerShell)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 20:40:47 -08:00

5.1 KiB

Why XML?

XML is the right format for a sovereign, attack-resistant message bus in a multi-agent system. JSON is not.

The Short Answer

Feature XML JSON
Schema validation XSD (built-in, precise) JSON Schema (optional, lossy)
Namespaces Native support None
Canonicalization C14N standard No standard
Repair tolerance lxml recover mode Parser fails
Comments Supported Forbidden
Mixed content Native Fragile

JSON's Origins

JSON (JavaScript Object Notation) was invented in the early 2000s as a subset of JavaScript literal syntax for simple data exchange in web browsers. It was never designed as a general-purpose format—just a quick way to serialize objects for Ajax calls.

It became popular because:

  • Simple for JavaScript developers
  • Human-readable
  • Web API boom (REST over SOAP)
  • Low barrier to entry

Why JSON Fails for Multi-Agent Systems

No Schema Enforcement

JSON Schema exists but is:

  • Optional (rarely enforced on wire)
  • Lossy (can't express all constraints)
  • Inconsistently implemented

Result: Messages accepted without validation, bugs discovered at runtime.

No Namespaces

Can't safely mix vocabularies:

{
  "name": "Alice",      // User name? Product name?
  "type": "admin"       // User type? Message type?
}

No Canonicalization

No standard way to normalize for signing:

{"a": 1, "b": 2}
{"b": 2, "a": 1}

Same data? Different bytes. Can't sign reliably.

No Repair Tolerance

One syntax error → entire payload rejected:

{"name": "Alice",}     // Trailing comma → FAIL

Escaping Hell

Strings with special characters are fragile:

{"message": "She said \"hello\""}   // Manual escaping

Easy to break, security vulnerability vector.

Why JSON Fails for LLM Integration

Hallucination Fragility

LLMs routinely produce invalid JSON:

  • Trailing commas
  • Missing quotes
  • Wrong nesting
  • Comments (forbidden!)

Result: Massive prompt bloat ("You MUST output valid JSON, NO trailing commas EVER...") and post-processing parsers.

No Graceful Degradation

One parse error → entire response lost. No partial recovery.

Injection Attacks

User input in strings can break JSON structure:

{"user_input": "Alice", "role": "admin"}

If user provides ", "role": "admin" in their name → injection.

Why XML Succeeds

Schema as Contract

XSD enforces exact structure on the wire:

<xs:element name="greeting">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="name" type="xs:string"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>

Every message validated before processing. No ambiguity.

Namespaces

Safe vocabulary mixing:

<message xmlns="https://xml-pipeline.org/ns/envelope/v1">
  <user:profile xmlns:user="https://example.org/user">
    <user:name>Alice</user:name>
  </user:profile>
</message>

Canonicalization (C14N)

Deterministic representation for signing:

c14n_bytes = etree.tostring(tree, method='c14n')
signature = sign(c14n_bytes)

Same logical content → same bytes → verifiable signatures.

Repair Tolerance

lxml recover mode fixes common issues:

parser = etree.XMLParser(recover=True)
tree = etree.fromstring(broken_xml, parser)

Partial documents, encoding issues, missing tags → recovered.

Self-Describing

Elements carry meaning:

<greeting>
  <name>Alice</name>
</greeting>

vs JSON:

["Alice"]  // What is this?

LLM + XML = Reliable

Natural Streaming

XML streams naturally (can process before complete).

Repair on Output

LLM produces broken XML? lxml fixes it:

from lxml import etree

parser = etree.XMLParser(recover=True)
tree = etree.fromstring(llm_output, parser)
# Works even with minor errors

Schema-Guided Generation

XSD tells LLM exactly what to produce:

Generate XML matching this schema:
<greeting><name>string</name></greeting>

Clear contract, fewer hallucinations.

Graceful Validation

Validation errors become helpful feedback:

<huh>
  <error>Element 'greeting' missing required element 'name'</error>
</huh>

LLM can self-correct.

The Trade-Offs

XML is More Verbose

<greeting><name>Alice</name></greeting>

vs

{"name": "Alice"}

But: Compression eliminates this on wire. And verbosity aids debugging.

XML Parsing is Slower

Microseconds more than JSON parsing.

But: Network latency dominates. And lxml is highly optimized.

XML is "Old"

True. Also mature, battle-tested, standards-based.

Conclusion

JSON won the web because it was "good enough" for stateless HTTP requests.

XML wins for multi-agent systems because:

  • Security requires schema enforcement
  • Signing requires canonicalization
  • LLMs require repair tolerance
  • Complexity requires namespaces

JSON won the web. XML wins the swarm.

Further Reading