xml-pipeline/docs/librarian-architecture.md
dullfig a5c00c1e90 Add BloxServer API scaffold + architecture docs
BloxServer API (FastAPI + SQLAlchemy async):
- Database models: users, flows, triggers, executions, usage tracking
- Clerk JWT auth with dev mode bypass for local testing
- SQLite support for local dev, PostgreSQL for production
- CRUD routes for flows, triggers, executions
- Public webhook endpoint with token auth
- Health/readiness endpoints
- Pydantic schemas with camelCase aliases for frontend
- Docker + docker-compose setup

Architecture documentation:
- Librarian architecture with RLM-powered query engine
- Stripe billing integration (usage-based, trials, webhooks)
- LLM abstraction layer (rate limiting, semantic cache, failover)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 22:04:25 -08:00

20 KiB

Librarian Architecture — RLM-Powered Document Intelligence

Status: Design Date: January 2026

Overview

The Librarian is an agent that ingests, indexes, and queries large document collections using the Recursive Language Model (RLM) pattern. It can handle codebases, documentation, and structured data at scales far beyond LLM context windows (10M+ tokens).

Key insight from MIT RLM research: Long contexts should be loaded as variables in a REPL environment, not fed directly to the neural network. The LLM writes code to examine, decompose, and recursively query chunks.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  RLM-Powered Librarian                                          │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Ingestion Pipeline                                         │  │
│  │                                                            │  │
│  │  Source → Detect Type → Select Chunker → Index → Store    │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Query Engine (RLM Pattern)                                 │  │
│  │                                                            │  │
│  │  Query → Search → Filter → Recursive Sub-Query → Answer   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Storage Layer                                              │  │
│  │                                                            │  │
│  │  eXist-db (XML) + Vector Embeddings + Dependency Graph    │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

The RLM Pattern

Traditional LLM usage stuffs entire documents into the prompt. This fails at scale:

  • Context windows have hard limits (128K-1M tokens)
  • Performance degrades with context length ("context rot")
  • Cost scales linearly with input size

RLM approach:

  1. Load as Variable: Documents become references, not inline content
  2. Programmatic Access: LLM writes code to peek into chunks
  3. Recursive Sub-Queries: llm_query(chunk, question) for focused analysis
  4. Aggregation: Combine sub-query results into final answer
# RLM-style pseudocode
async def handle_query(query: str, codebase: CodebaseRef):
    # 1. Search index for relevant chunks (not full content)
    hits = await search_index(codebase, query)

    # 2. Filter if too many results
    if len(hits) > 10:
        hits = await llm_filter(hits, query)  # LLM picks most relevant

    # 3. Recursive sub-queries on each chunk
    findings = []
    for hit in hits:
        chunk = await load_chunk(hit)
        result = await llm_query(
            f"Analyze this for: {query}\n\n{chunk}"
        )
        findings.append(result)

    # 4. Aggregate into final answer
    return await llm_synthesize(findings, query)

Hybrid Chunking Architecture

Chunking is domain-specific. A C++ class should stay together; a legal clause shouldn't be split mid-sentence. We use a hybrid approach:

Built-in Chunkers (Fast Path)

Chunker File Types Strategy Implementation
Code .c, .cpp, .py, .js, .rs, ... AST-aware splitting tree-sitter
Markdown/Docs .md, .rst, .txt Heading hierarchy Custom parser
Structured Data .json, .xml, .yaml Schema-aware lxml + json
Plain Text emails, logs, notes Semantic paragraphs Sentence boundaries

These cover ~90% of use cases with optimized, predictable behavior.

WASM Factory (Fallback for Unknown Types)

For novel formats, the AI generates a custom chunker:

User uploads proprietary format
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Sample Analysis                                    │
│                                                            │
│ AI examines sample files:                                 │
│ - Structure patterns                                      │
│ - Record boundaries                                       │
│ - Semantic units                                          │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Generate Chunker (Rust → WASM)                    │
│                                                            │
│ AI writes Rust code implementing the chunker interface    │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Compile & Validate                                │
│                                                            │
│ cargo build --target wasm32-wasi                          │
│ Test on sample files                                      │
│ AI reviews output quality                                 │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Deploy                                            │
│                                                            │
│ Store in user's WASM modules                              │
│ Optional: publish to marketplace                          │
└───────────────────────────────────────────────────────────┘

WASM Chunker Interface (WIT)

// chunker.wit
interface chunker {
    record chunk {
        id: string,
        content: string,
        metadata: list<tuple<string, string>>,
        parent-id: option<string>,
        children: list<string>,
    }

    record chunker-config {
        file-type: string,
        max-chunk-size: u32,
        preserve-context: bool,
        custom-params: list<tuple<string, string>>,
    }

    // Analyze sample data, return chunking config
    analyze: func(sample: string, file-type: string) -> chunker-config

    // Chunk a file using the config
    chunk-file: func(content: string, config: chunker-config) -> list<chunk>
}

Ingestion Pipeline

Step 1: Source Acquisition

@dataclass
class IngestionSource:
    type: Literal["git", "upload", "url", "s3"]
    location: str
    filter: str | None = None  # e.g., "*.cpp", "docs/**/*.md"

Supported sources:

  • Git repository: Clone and track branches
  • File upload: Direct upload via UI
  • URL: Fetch remote documents
  • S3/Cloud storage: Enterprise integrations

Step 2: Type Detection

def detect_type(file_path: str, content: bytes) -> FileType:
    # 1. Check extension
    ext = Path(file_path).suffix.lower()
    if ext in CODE_EXTENSIONS:
        return FileType.CODE

    # 2. Check magic bytes
    if content.startswith(b'%PDF'):
        return FileType.PDF

    # 3. Content analysis
    if looks_like_markdown(content):
        return FileType.MARKDOWN

    return FileType.PLAIN_TEXT

Step 3: Chunking

def select_chunker(file_type: FileType, user_config: ChunkerConfig) -> Chunker:
    # User override
    if user_config.custom_wasm:
        return WasmChunker(user_config.custom_wasm)

    # Built-in chunkers
    match file_type:
        case FileType.CODE:
            return TreeSitterChunker(language=detect_language(file_type))
        case FileType.MARKDOWN:
            return MarkdownChunker()
        case FileType.JSON | FileType.XML | FileType.YAML:
            return StructuredDataChunker()
        case _:
            return PlainTextChunker()

Step 4: Indexing

Each chunk is indexed in multiple ways:

Index Type Purpose Implementation
Full-text Keyword search eXist-db Lucene
Vector Semantic similarity Embeddings (OpenAI/local)
Graph Relationships Class hierarchy, imports, references
Metadata Filtering File path, type, timestamp

Step 5: Storage

<!-- Chunk stored in eXist-db -->
<chunk xmlns="https://bloxserver.io/ns/librarian/v1">
  <id>opencascade:BRepBuilderAPI_MakeEdge:constructor_1</id>
  <source>
    <repo>opencascade</repo>
    <path>src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx</path>
    <lines start="42" end="87"/>
  </source>
  <type>function</type>
  <metadata>
    <class>BRepBuilderAPI_MakeEdge</class>
    <visibility>public</visibility>
    <params>const TopoDS_Vertex&amp;, const TopoDS_Vertex&amp;</params>
  </metadata>
  <content><![CDATA[
BRepBuilderAPI_MakeEdge::BRepBuilderAPI_MakeEdge(
    const TopoDS_Vertex& V1,
    const TopoDS_Vertex& V2)
{
    // ... implementation
}
  ]]></content>
  <embedding>[0.023, -0.041, 0.089, ...]</embedding>
</chunk>

Query Engine

Query Flow

User: "How does BRepBuilderAPI_MakeEdge handle degenerate curves?"
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Search                                            │
│                                                           │
│ - Vector search: find semantically similar chunks         │
│ - Keyword search: "BRepBuilderAPI_MakeEdge" + "degenerate"│
│ - Graph traversal: class hierarchy, method calls          │
│                                                           │
│ Result: 47 potentially relevant chunks                    │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Filter (LLM-assisted)                             │
│                                                           │
│ Too many chunks for direct analysis.                      │
│ LLM reviews summaries, picks top 8 most relevant.         │
│                                                           │
│ Selected:                                                 │
│ - BRepBuilderAPI_MakeEdge constructors (3 chunks)        │
│ - Edge validation methods (2 chunks)                      │
│ - Degenerate curve handling (2 chunks)                    │
│ - Error reporting (1 chunk)                               │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Recursive Sub-Queries                             │
│                                                           │
│ For each chunk, focused LLM query:                        │
│                                                           │
│ llm_query(chunk_1, "How does this handle degenerate...")  │
│ llm_query(chunk_2, "What validation happens here...")     │
│ llm_query(chunk_3, "What errors are raised for...")       │
│ ...                                                       │
│                                                           │
│ 8 parallel sub-queries → 8 focused findings               │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Synthesize                                        │
│                                                           │
│ LLM combines findings into coherent answer:               │
│                                                           │
│ "BRepBuilderAPI_MakeEdge handles degenerate curves by:   │
│  1. Checking curve bounds in the constructor...           │
│  2. Calling BRepCheck_Edge for validation...              │
│  3. Setting myError to BRepBuilderAPI_CurveTooSmall..."   │
└───────────────────────────────────────────────────────────┘

Handler Implementation

@xmlify
@dataclass
class LibrarianQuery:
    """Query the librarian for information."""
    collection: str          # Which indexed collection
    question: str            # Natural language question
    max_chunks: int = 10     # Limit for recursive queries
    include_sources: bool = True

@xmlify
@dataclass
class LibrarianResponse:
    """Response from librarian with sources."""
    answer: str
    sources: list[SourceReference]
    confidence: float

async def handle_librarian_query(
    payload: LibrarianQuery,
    metadata: HandlerMetadata
) -> HandlerResponse:
    """RLM-style query handler."""

    # 1. Search for relevant chunks
    hits = await search_collection(
        payload.collection,
        payload.question,
        limit=50  # Cast wide net
    )

    # 2. Filter if needed
    if len(hits) > payload.max_chunks:
        hits = await llm_filter_chunks(
            hits,
            payload.question,
            limit=payload.max_chunks
        )

    # 3. Recursive sub-queries
    findings = await asyncio.gather(*[
        llm_analyze_chunk(chunk, payload.question)
        for chunk in hits
    ])

    # 4. Synthesize answer
    answer = await llm_synthesize(findings, payload.question)

    # 5. Build response
    sources = [
        SourceReference(
            path=hit.source_path,
            lines=(hit.start_line, hit.end_line),
            relevance=hit.score
        )
        for hit in hits
    ]

    return HandlerResponse.respond(
        payload=LibrarianResponse(
            answer=answer,
            sources=sources if payload.include_sources else [],
            confidence=calculate_confidence(findings)
        )
    )

Storage Layer

eXist-db (Primary Store)

XML-native database for chunk storage and XQuery retrieval.

Why eXist-db:

  • Native XQuery for complex queries
  • Full-text search with Lucene
  • XML validation against schemas
  • Transactional updates

Collections structure:

/db/librarian/
├── collections/
│   ├── {user_id}/
│   │   ├── {collection_id}/
│   │   │   ├── metadata.xml
│   │   │   ├── chunks/
│   │   │   │   ├── chunk_001.xml
│   │   │   │   ├── chunk_002.xml
│   │   │   │   └── ...
│   │   │   └── index/
│   │   │       └── embeddings.bin

Vector Embeddings

For semantic search, chunks are embedded using:

  • OpenAI text-embedding-3-small (cloud)
  • Sentence Transformers (local/self-hosted)

Embeddings stored alongside chunks or in dedicated vector DB (Qdrant/Pinecone for scale).

Dependency Graph

For code collections, track relationships:

  • Class hierarchy: inheritance, interfaces
  • Imports: file dependencies
  • Call graph: function → function references

Stored in eXist-db as XML or external graph DB for complex traversals.

Configuration

organism.yaml

listeners:
  - name: librarian
    handler: xml_pipeline.tools.librarian.handle_librarian_query
    payload_class: xml_pipeline.tools.librarian.LibrarianQuery
    description: Query indexed document collections
    agent: true
    peers: []  # Terminal handler
    config:
      exist_db:
        url: "http://localhost:8080/exist"
        user_env: EXIST_USER
        password_env: EXIST_PASSWORD
      embeddings:
        provider: openai  # or "local"
        model: text-embedding-3-small
      chunkers:
        code:
          max_chunk_size: 2000
          overlap: 200
        markdown:
          split_on_headings: true
          min_heading_level: 2

Ingestion API

# Ingest a git repository
await librarian.ingest(
    source=GitSource(
        url="https://github.com/Open-Cascade-SAS/OCCT",
        branch="master",
        filter="src/**/*.cxx"
    ),
    collection="opencascade",
    chunker_config=CodeChunkerConfig(
        language="cpp",
        max_chunk_size=2000
    )
)

# Query the collection
response = await librarian.query(
    collection="opencascade",
    question="How does BRepBuilderAPI_MakeEdge handle curves?"
)

Scaling Considerations

Scale Storage Search Compute
Small (<10K chunks) eXist-db local In-DB Lucene Single node
Medium (10K-1M) eXist-db cluster + Vector DB Multi-worker
Large (1M+) Sharded storage Distributed search GPU embeddings

Security

  • Collection isolation: Users can only query their own collections
  • WASM sandbox: Custom chunkers run in isolated WASM runtime
  • Rate limiting: Prevent abuse of recursive queries
  • Audit logging: Track all queries for compliance

Future Enhancements

  1. Incremental updates: Re-index only changed files
  2. Cross-collection queries: Search across multiple codebases
  3. Collaborative collections: Shared team libraries
  4. Query caching: Cache common sub-queries
  5. Streaming ingestion: Real-time updates from git webhooks

References