xml-pipeline/docs/why-not-json.md
2026-01-03 17:57:07 -08:00

66 lines
No EOL
4.6 KiB
Markdown

# So Why XML and Not JSON?
XML is the right format for a sovereign, attack-resistant message bus in a multi-agent organism. JSON is not — it's a lightweight data interchange hack that exploded in popularity but carries fundamental flaws that make it brittle, insecure, and unsuitable for precise, evolvable contracts.
This project uses Exclusive C14N XML on the wire, XSD for exact validation, and repair for tolerance. The result: no drift, no escaping nightmares, no hidden parsing ambiguities. Contracts are enforced by standards, not convention.
## Where Did JSON Come From?
JSON (JavaScript Object Notation) was invented in the early 2000s by Douglas Crockford as a subset of JavaScript literal syntax for simple data exchange in web browsers. It was never designed as a general-purpose format — just a quick way to serialize objects for Ajax calls without eval() dangers.
It started as "fat-free XML" for dynamic web pages, stripping tags for brevity.
## What Made It Popular?
- **Simplicity for JS devs**: No schema, human-readable, easy to generate/parse in browsers.
- **Web API boom**: REST APIs adopted it over XML (SOAP was verbose/bloated).
- **Ecosystem inertia**: Libraries everywhere, low barrier for startups.
- **Perceived lightness**: Smaller payloads than verbose XML.
It spread because it was "good enough" for stateless HTTP requests in the 2010s web era.
## How It Carried Over to AI
LLM tool calling adopted JSON because:
- OpenAI's function calling API used JSON schemas.
- Everyone copied the leader — brittle but "standard".
- Prompt engineers learned to coerce models into valid JSON with endless instructions ("always output valid JSON, no trailing commas...").
Result: Massive prompt bloat, hallucinated formats, post-processing parsers, and constant fixes for escaping/order issues.
## What Makes It the Wrong Format
JSON lacks:
- **Namespaces** — no way to mix vocabularies safely.
- **Schemas as first-class contracts** — JSON Schema is optional, lossy, and rarely enforced on wire.
- **Canonicalization** — no standard way to normalize for signing/comparison.
- **Comments** — forbidden, forcing side channels.
- **Mixed content** — text + structure fragile.
It's order-sensitive (objects) but unordered by spec, leading to bugs.
## All the (Really) Bad Things That JSON Brings
- **Escaping hell**: Strings with quotes/newlines/tabs require manual escaping; easy to break.
- **No validation on wire**: Servers trust client JSON — injection attacks common.
- **Order ambiguity**: Objects unordered, arrays ordered — inconsistent, parsing surprises.
- **No self-description**: Types inferred, no built-in schema reference.
- **Brittle AI wiring**: LLMs hallucinate invalid JSON (trailing commas, wrong types) → endless retry loops or custom parsers.
- **Drift prone**: No standard evolution path — schemas change, old clients break silently.
## Red Team Horror Stories from JSON in AI
JSON's flaws turn into nightmares when LLMs are involved — brittle formats meet hallucination-prone models:
- **Hallucinated Invalid JSON**: LLMs routinely add trailing commas, forget quotes, or nest wrong — forcing massive prompt bloat ("You MUST output valid JSON, no trailing commas EVER") and post-processing parsers. One missing brace → entire tool call dropped, reasoning loop broken.
- **Escaping Injection Hell**: User input with quotes/newlines in strings? Escaping fails → malformed JSON → parser crashes or silent data loss. Red teams exploit this for prompt injection (embed control sequences in "safe" strings).
- **Order & Type Ambiguity**: Objects unordered → tools break on key order assumptions. Numbers as strings → type confusion attacks. Arrays of mixed types → validation impossible without custom code.
- **No Tolerance for Dirt**: LLM streams comments or extra text? JSON parsers choke. No repair — whole response rejected, forcing retries and token waste.
- **Real-World Breaks**: Early OpenAI function calling — endless "invalid JSON" errors until prompts became novels. Projects add custom "JSON repair" libraries — admitting the format's fragility.
XML + XSD + repair avoids this entirely: exact contracts enforced on wire, dirty streams tolerated via dummy extraction, no escaping quagmires. The organism stays sovereign.
JSON lost the AI war before it started.
XML (with XSD + C14N) solves these: exact contracts, namespaces, repair tolerance, signing, comments if needed. It's heavier on disk but sovereign on wire — perfect for a bounded organism where security and auditability matter more than minimal bytes.
JSON won the web. XML wins the swarm. Permanently.