dullfig a5c00c1e90 Add BloxServer API scaffold + architecture docs

BloxServer API (FastAPI + SQLAlchemy async):
- Database models: users, flows, triggers, executions, usage tracking
- Clerk JWT auth with dev mode bypass for local testing
- SQLite support for local dev, PostgreSQL for production
- CRUD routes for flows, triggers, executions
- Public webhook endpoint with token auth
- Health/readiness endpoints
- Pydantic schemas with camelCase aliases for frontend
- Docker + docker-compose setup

Architecture documentation:
- Librarian architecture with RLM-powered query engine
- Stripe billing integration (usage-based, trials, webhooks)
- LLM abstraction layer (rate limiting, semantic cache, failover)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-22 22:04:25 -08:00

20 KiB

Raw Blame History

Librarian Architecture — RLM-Powered Document Intelligence

Status: Design Date: January 2026

Overview

The Librarian is an agent that ingests, indexes, and queries large document collections using the Recursive Language Model (RLM) pattern. It can handle codebases, documentation, and structured data at scales far beyond LLM context windows (10M+ tokens).

Key insight from MIT RLM research: Long contexts should be loaded as variables in a REPL environment, not fed directly to the neural network. The LLM writes code to examine, decompose, and recursively query chunks.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  RLM-Powered Librarian                                          │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Ingestion Pipeline                                         │  │
│  │                                                            │  │
│  │  Source → Detect Type → Select Chunker → Index → Store    │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Query Engine (RLM Pattern)                                 │  │
│  │                                                            │  │
│  │  Query → Search → Filter → Recursive Sub-Query → Answer   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Storage Layer                                              │  │
│  │                                                            │  │
│  │  eXist-db (XML) + Vector Embeddings + Dependency Graph    │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

The RLM Pattern

Traditional LLM usage stuffs entire documents into the prompt. This fails at scale:

Context windows have hard limits (128K-1M tokens)
Performance degrades with context length ("context rot")
Cost scales linearly with input size

RLM approach:

Load as Variable: Documents become references, not inline content
Programmatic Access: LLM writes code to peek into chunks
Recursive Sub-Queries: llm_query(chunk, question) for focused analysis
Aggregation: Combine sub-query results into final answer

# RLM-style pseudocode
async def handle_query(query: str, codebase: CodebaseRef):
    # 1. Search index for relevant chunks (not full content)
    hits = await search_index(codebase, query)

    # 2. Filter if too many results
    if len(hits) > 10:
        hits = await llm_filter(hits, query)  # LLM picks most relevant

    # 3. Recursive sub-queries on each chunk
    findings = []
    for hit in hits:
        chunk = await load_chunk(hit)
        result = await llm_query(
            f"Analyze this for: {query}\n\n{chunk}"
        )
        findings.append(result)

    # 4. Aggregate into final answer
    return await llm_synthesize(findings, query)

Hybrid Chunking Architecture

Chunking is domain-specific. A C++ class should stay together; a legal clause shouldn't be split mid-sentence. We use a hybrid approach:

Built-in Chunkers (Fast Path)

Chunker	File Types	Strategy	Implementation
Code	.c, .cpp, .py, .js, .rs, ...	AST-aware splitting	tree-sitter
Markdown/Docs	.md, .rst, .txt	Heading hierarchy	Custom parser
Structured Data	.json, .xml, .yaml	Schema-aware	lxml + json
Plain Text	emails, logs, notes	Semantic paragraphs	Sentence boundaries

These cover ~90% of use cases with optimized, predictable behavior.

WASM Factory (Fallback for Unknown Types)

For novel formats, the AI generates a custom chunker:

User uploads proprietary format
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Sample Analysis                                    │
│                                                            │
│ AI examines sample files:                                 │
│ - Structure patterns                                      │
│ - Record boundaries                                       │
│ - Semantic units                                          │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Generate Chunker (Rust → WASM)                    │
│                                                            │
│ AI writes Rust code implementing the chunker interface    │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Compile & Validate                                │
│                                                            │
│ cargo build --target wasm32-wasi                          │
│ Test on sample files                                      │
│ AI reviews output quality                                 │
└───────────────────────────────────────────────────────────┘
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Deploy                                            │
│                                                            │
│ Store in user's WASM modules                              │
│ Optional: publish to marketplace                          │
└───────────────────────────────────────────────────────────┘

WASM Chunker Interface (WIT)

// chunker.wit
interface chunker {
    record chunk {
        id: string,
        content: string,
        metadata: list<tuple<string, string>>,
        parent-id: option<string>,
        children: list<string>,
    }

    record chunker-config {
        file-type: string,
        max-chunk-size: u32,
        preserve-context: bool,
        custom-params: list<tuple<string, string>>,
    }

    // Analyze sample data, return chunking config
    analyze: func(sample: string, file-type: string) -> chunker-config

    // Chunk a file using the config
    chunk-file: func(content: string, config: chunker-config) -> list<chunk>
}

Ingestion Pipeline

Step 1: Source Acquisition

@dataclass
class IngestionSource:
    type: Literal["git", "upload", "url", "s3"]
    location: str
    filter: str | None = None  # e.g., "*.cpp", "docs/**/*.md"

Supported sources:

Git repository: Clone and track branches
File upload: Direct upload via UI
URL: Fetch remote documents
S3/Cloud storage: Enterprise integrations

Step 2: Type Detection

def detect_type(file_path: str, content: bytes) -> FileType:
    # 1. Check extension
    ext = Path(file_path).suffix.lower()
    if ext in CODE_EXTENSIONS:
        return FileType.CODE

    # 2. Check magic bytes
    if content.startswith(b'%PDF'):
        return FileType.PDF

    # 3. Content analysis
    if looks_like_markdown(content):
        return FileType.MARKDOWN

    return FileType.PLAIN_TEXT

Step 3: Chunking

def select_chunker(file_type: FileType, user_config: ChunkerConfig) -> Chunker:
    # User override
    if user_config.custom_wasm:
        return WasmChunker(user_config.custom_wasm)

    # Built-in chunkers
    match file_type:
        case FileType.CODE:
            return TreeSitterChunker(language=detect_language(file_type))
        case FileType.MARKDOWN:
            return MarkdownChunker()
        case FileType.JSON | FileType.XML | FileType.YAML:
            return StructuredDataChunker()
        case _:
            return PlainTextChunker()

Step 4: Indexing

Each chunk is indexed in multiple ways:

Index Type	Purpose	Implementation
Full-text	Keyword search	eXist-db Lucene
Vector	Semantic similarity	Embeddings (OpenAI/local)
Graph	Relationships	Class hierarchy, imports, references
Metadata	Filtering	File path, type, timestamp

Step 5: Storage

<!-- Chunk stored in eXist-db -->
<chunk xmlns="https://bloxserver.io/ns/librarian/v1">
  <id>opencascade:BRepBuilderAPI_MakeEdge:constructor_1</id>
  <source>
    <repo>opencascade</repo>
    <path>src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx</path>
    <lines start="42" end="87"/>
  </source>
  <type>function</type>
  <metadata>
    <class>BRepBuilderAPI_MakeEdge</class>
    <visibility>public</visibility>
    <params>const TopoDS_Vertex&amp;, const TopoDS_Vertex&amp;</params>
  </metadata>
  <content><![CDATA[
BRepBuilderAPI_MakeEdge::BRepBuilderAPI_MakeEdge(
    const TopoDS_Vertex& V1,
    const TopoDS_Vertex& V2)
{
    // ... implementation
}
  ]]></content>
  <embedding>[0.023, -0.041, 0.089, ...]</embedding>
</chunk>

Query Engine

Query Flow

User: "How does BRepBuilderAPI_MakeEdge handle degenerate curves?"
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Search                                            │
│                                                           │
│ - Vector search: find semantically similar chunks         │
│ - Keyword search: "BRepBuilderAPI_MakeEdge" + "degenerate"│
│ - Graph traversal: class hierarchy, method calls          │
│                                                           │
│ Result: 47 potentially relevant chunks                    │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Filter (LLM-assisted)                             │
│                                                           │
│ Too many chunks for direct analysis.                      │
│ LLM reviews summaries, picks top 8 most relevant.         │
│                                                           │
│ Selected:                                                 │
│ - BRepBuilderAPI_MakeEdge constructors (3 chunks)        │
│ - Edge validation methods (2 chunks)                      │
│ - Degenerate curve handling (2 chunks)                    │
│ - Error reporting (1 chunk)                               │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Recursive Sub-Queries                             │
│                                                           │
│ For each chunk, focused LLM query:                        │
│                                                           │
│ llm_query(chunk_1, "How does this handle degenerate...")  │
│ llm_query(chunk_2, "What validation happens here...")     │
│ llm_query(chunk_3, "What errors are raised for...")       │
│ ...                                                       │
│                                                           │
│ 8 parallel sub-queries → 8 focused findings               │
└───────────────────────────────────────────────────────────┘
                    │
                    ▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Synthesize                                        │
│                                                           │
│ LLM combines findings into coherent answer:               │
│                                                           │
│ "BRepBuilderAPI_MakeEdge handles degenerate curves by:   │
│  1. Checking curve bounds in the constructor...           │
│  2. Calling BRepCheck_Edge for validation...              │
│  3. Setting myError to BRepBuilderAPI_CurveTooSmall..."   │
└───────────────────────────────────────────────────────────┘

Handler Implementation

@xmlify
@dataclass
class LibrarianQuery:
    """Query the librarian for information."""
    collection: str          # Which indexed collection
    question: str            # Natural language question
    max_chunks: int = 10     # Limit for recursive queries
    include_sources: bool = True

@xmlify
@dataclass
class LibrarianResponse:
    """Response from librarian with sources."""
    answer: str
    sources: list[SourceReference]
    confidence: float

async def handle_librarian_query(
    payload: LibrarianQuery,
    metadata: HandlerMetadata
) -> HandlerResponse:
    """RLM-style query handler."""

    # 1. Search for relevant chunks
    hits = await search_collection(
        payload.collection,
        payload.question,
        limit=50  # Cast wide net
    )

    # 2. Filter if needed
    if len(hits) > payload.max_chunks:
        hits = await llm_filter_chunks(
            hits,
            payload.question,
            limit=payload.max_chunks
        )

    # 3. Recursive sub-queries
    findings = await asyncio.gather(*[
        llm_analyze_chunk(chunk, payload.question)
        for chunk in hits
    ])

    # 4. Synthesize answer
    answer = await llm_synthesize(findings, payload.question)

    # 5. Build response
    sources = [
        SourceReference(
            path=hit.source_path,
            lines=(hit.start_line, hit.end_line),
            relevance=hit.score
        )
        for hit in hits
    ]

    return HandlerResponse.respond(
        payload=LibrarianResponse(
            answer=answer,
            sources=sources if payload.include_sources else [],
            confidence=calculate_confidence(findings)
        )
    )

Storage Layer

eXist-db (Primary Store)

XML-native database for chunk storage and XQuery retrieval.

Why eXist-db:

Native XQuery for complex queries
Full-text search with Lucene
XML validation against schemas
Transactional updates

Collections structure:

/db/librarian/
├── collections/
│   ├── {user_id}/
│   │   ├── {collection_id}/
│   │   │   ├── metadata.xml
│   │   │   ├── chunks/
│   │   │   │   ├── chunk_001.xml
│   │   │   │   ├── chunk_002.xml
│   │   │   │   └── ...
│   │   │   └── index/
│   │   │       └── embeddings.bin

Vector Embeddings

For semantic search, chunks are embedded using:

OpenAI text-embedding-3-small (cloud)
Sentence Transformers (local/self-hosted)

Embeddings stored alongside chunks or in dedicated vector DB (Qdrant/Pinecone for scale).

Dependency Graph

For code collections, track relationships:

Class hierarchy: inheritance, interfaces
Imports: file dependencies
Call graph: function → function references

Stored in eXist-db as XML or external graph DB for complex traversals.

Configuration

organism.yaml

listeners:
  - name: librarian
    handler: xml_pipeline.tools.librarian.handle_librarian_query
    payload_class: xml_pipeline.tools.librarian.LibrarianQuery
    description: Query indexed document collections
    agent: true
    peers: []  # Terminal handler
    config:
      exist_db:
        url: "http://localhost:8080/exist"
        user_env: EXIST_USER
        password_env: EXIST_PASSWORD
      embeddings:
        provider: openai  # or "local"
        model: text-embedding-3-small
      chunkers:
        code:
          max_chunk_size: 2000
          overlap: 200
        markdown:
          split_on_headings: true
          min_heading_level: 2

Ingestion API

# Ingest a git repository
await librarian.ingest(
    source=GitSource(
        url="https://github.com/Open-Cascade-SAS/OCCT",
        branch="master",
        filter="src/**/*.cxx"
    ),
    collection="opencascade",
    chunker_config=CodeChunkerConfig(
        language="cpp",
        max_chunk_size=2000
    )
)

# Query the collection
response = await librarian.query(
    collection="opencascade",
    question="How does BRepBuilderAPI_MakeEdge handle curves?"
)

Scaling Considerations

Scale	Storage	Search	Compute
Small (<10K chunks)	eXist-db local	In-DB Lucene	Single node
Medium (10K-1M)	eXist-db cluster	+ Vector DB	Multi-worker
Large (1M+)	Sharded storage	Distributed search	GPU embeddings

Security

Collection isolation: Users can only query their own collections
WASM sandbox: Custom chunkers run in isolated WASM runtime
Rate limiting: Prevent abuse of recursive queries
Audit logging: Track all queries for compliance

Future Enhancements

Incremental updates: Re-index only changed files
Cross-collection queries: Search across multiple codebases
Collaborative collections: Shared team libraries
Query caching: Cache common sub-queries
Streaming ingestion: Real-time updates from git webhooks

References

Recursive Language Models (MIT) — Foundational research on RLM pattern
tree-sitter — AST-aware code parsing
eXist-db — XML-native database
BloxServer Architecture — Platform overview

20 KiB Raw Blame History