BloxServer API (FastAPI + SQLAlchemy async): - Database models: users, flows, triggers, executions, usage tracking - Clerk JWT auth with dev mode bypass for local testing - SQLite support for local dev, PostgreSQL for production - CRUD routes for flows, triggers, executions - Public webhook endpoint with token auth - Health/readiness endpoints - Pydantic schemas with camelCase aliases for frontend - Docker + docker-compose setup Architecture documentation: - Librarian architecture with RLM-powered query engine - Stripe billing integration (usage-based, trials, webhooks) - LLM abstraction layer (rate limiting, semantic cache, failover) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
20 KiB
Librarian Architecture — RLM-Powered Document Intelligence
Status: Design Date: January 2026
Overview
The Librarian is an agent that ingests, indexes, and queries large document collections using the Recursive Language Model (RLM) pattern. It can handle codebases, documentation, and structured data at scales far beyond LLM context windows (10M+ tokens).
Key insight from MIT RLM research: Long contexts should be loaded as variables in a REPL environment, not fed directly to the neural network. The LLM writes code to examine, decompose, and recursively query chunks.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ RLM-Powered Librarian │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Ingestion Pipeline │ │
│ │ │ │
│ │ Source → Detect Type → Select Chunker → Index → Store │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Query Engine (RLM Pattern) │ │
│ │ │ │
│ │ Query → Search → Filter → Recursive Sub-Query → Answer │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ │ │
│ │ eXist-db (XML) + Vector Embeddings + Dependency Graph │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The RLM Pattern
Traditional LLM usage stuffs entire documents into the prompt. This fails at scale:
- Context windows have hard limits (128K-1M tokens)
- Performance degrades with context length ("context rot")
- Cost scales linearly with input size
RLM approach:
- Load as Variable: Documents become references, not inline content
- Programmatic Access: LLM writes code to peek into chunks
- Recursive Sub-Queries:
llm_query(chunk, question)for focused analysis - Aggregation: Combine sub-query results into final answer
# RLM-style pseudocode
async def handle_query(query: str, codebase: CodebaseRef):
# 1. Search index for relevant chunks (not full content)
hits = await search_index(codebase, query)
# 2. Filter if too many results
if len(hits) > 10:
hits = await llm_filter(hits, query) # LLM picks most relevant
# 3. Recursive sub-queries on each chunk
findings = []
for hit in hits:
chunk = await load_chunk(hit)
result = await llm_query(
f"Analyze this for: {query}\n\n{chunk}"
)
findings.append(result)
# 4. Aggregate into final answer
return await llm_synthesize(findings, query)
Hybrid Chunking Architecture
Chunking is domain-specific. A C++ class should stay together; a legal clause shouldn't be split mid-sentence. We use a hybrid approach:
Built-in Chunkers (Fast Path)
| Chunker | File Types | Strategy | Implementation |
|---|---|---|---|
| Code | .c, .cpp, .py, .js, .rs, ... | AST-aware splitting | tree-sitter |
| Markdown/Docs | .md, .rst, .txt | Heading hierarchy | Custom parser |
| Structured Data | .json, .xml, .yaml | Schema-aware | lxml + json |
| Plain Text | emails, logs, notes | Semantic paragraphs | Sentence boundaries |
These cover ~90% of use cases with optimized, predictable behavior.
WASM Factory (Fallback for Unknown Types)
For novel formats, the AI generates a custom chunker:
User uploads proprietary format
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Sample Analysis │
│ │
│ AI examines sample files: │
│ - Structure patterns │
│ - Record boundaries │
│ - Semantic units │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Generate Chunker (Rust → WASM) │
│ │
│ AI writes Rust code implementing the chunker interface │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Compile & Validate │
│ │
│ cargo build --target wasm32-wasi │
│ Test on sample files │
│ AI reviews output quality │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Deploy │
│ │
│ Store in user's WASM modules │
│ Optional: publish to marketplace │
└───────────────────────────────────────────────────────────┘
WASM Chunker Interface (WIT)
// chunker.wit
interface chunker {
record chunk {
id: string,
content: string,
metadata: list<tuple<string, string>>,
parent-id: option<string>,
children: list<string>,
}
record chunker-config {
file-type: string,
max-chunk-size: u32,
preserve-context: bool,
custom-params: list<tuple<string, string>>,
}
// Analyze sample data, return chunking config
analyze: func(sample: string, file-type: string) -> chunker-config
// Chunk a file using the config
chunk-file: func(content: string, config: chunker-config) -> list<chunk>
}
Ingestion Pipeline
Step 1: Source Acquisition
@dataclass
class IngestionSource:
type: Literal["git", "upload", "url", "s3"]
location: str
filter: str | None = None # e.g., "*.cpp", "docs/**/*.md"
Supported sources:
- Git repository: Clone and track branches
- File upload: Direct upload via UI
- URL: Fetch remote documents
- S3/Cloud storage: Enterprise integrations
Step 2: Type Detection
def detect_type(file_path: str, content: bytes) -> FileType:
# 1. Check extension
ext = Path(file_path).suffix.lower()
if ext in CODE_EXTENSIONS:
return FileType.CODE
# 2. Check magic bytes
if content.startswith(b'%PDF'):
return FileType.PDF
# 3. Content analysis
if looks_like_markdown(content):
return FileType.MARKDOWN
return FileType.PLAIN_TEXT
Step 3: Chunking
def select_chunker(file_type: FileType, user_config: ChunkerConfig) -> Chunker:
# User override
if user_config.custom_wasm:
return WasmChunker(user_config.custom_wasm)
# Built-in chunkers
match file_type:
case FileType.CODE:
return TreeSitterChunker(language=detect_language(file_type))
case FileType.MARKDOWN:
return MarkdownChunker()
case FileType.JSON | FileType.XML | FileType.YAML:
return StructuredDataChunker()
case _:
return PlainTextChunker()
Step 4: Indexing
Each chunk is indexed in multiple ways:
| Index Type | Purpose | Implementation |
|---|---|---|
| Full-text | Keyword search | eXist-db Lucene |
| Vector | Semantic similarity | Embeddings (OpenAI/local) |
| Graph | Relationships | Class hierarchy, imports, references |
| Metadata | Filtering | File path, type, timestamp |
Step 5: Storage
<!-- Chunk stored in eXist-db -->
<chunk xmlns="https://bloxserver.io/ns/librarian/v1">
<id>opencascade:BRepBuilderAPI_MakeEdge:constructor_1</id>
<source>
<repo>opencascade</repo>
<path>src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx</path>
<lines start="42" end="87"/>
</source>
<type>function</type>
<metadata>
<class>BRepBuilderAPI_MakeEdge</class>
<visibility>public</visibility>
<params>const TopoDS_Vertex&, const TopoDS_Vertex&</params>
</metadata>
<content><![CDATA[
BRepBuilderAPI_MakeEdge::BRepBuilderAPI_MakeEdge(
const TopoDS_Vertex& V1,
const TopoDS_Vertex& V2)
{
// ... implementation
}
]]></content>
<embedding>[0.023, -0.041, 0.089, ...]</embedding>
</chunk>
Query Engine
Query Flow
User: "How does BRepBuilderAPI_MakeEdge handle degenerate curves?"
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 1: Search │
│ │
│ - Vector search: find semantically similar chunks │
│ - Keyword search: "BRepBuilderAPI_MakeEdge" + "degenerate"│
│ - Graph traversal: class hierarchy, method calls │
│ │
│ Result: 47 potentially relevant chunks │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 2: Filter (LLM-assisted) │
│ │
│ Too many chunks for direct analysis. │
│ LLM reviews summaries, picks top 8 most relevant. │
│ │
│ Selected: │
│ - BRepBuilderAPI_MakeEdge constructors (3 chunks) │
│ - Edge validation methods (2 chunks) │
│ - Degenerate curve handling (2 chunks) │
│ - Error reporting (1 chunk) │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 3: Recursive Sub-Queries │
│ │
│ For each chunk, focused LLM query: │
│ │
│ llm_query(chunk_1, "How does this handle degenerate...") │
│ llm_query(chunk_2, "What validation happens here...") │
│ llm_query(chunk_3, "What errors are raised for...") │
│ ... │
│ │
│ 8 parallel sub-queries → 8 focused findings │
└───────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────┐
│ Step 4: Synthesize │
│ │
│ LLM combines findings into coherent answer: │
│ │
│ "BRepBuilderAPI_MakeEdge handles degenerate curves by: │
│ 1. Checking curve bounds in the constructor... │
│ 2. Calling BRepCheck_Edge for validation... │
│ 3. Setting myError to BRepBuilderAPI_CurveTooSmall..." │
└───────────────────────────────────────────────────────────┘
Handler Implementation
@xmlify
@dataclass
class LibrarianQuery:
"""Query the librarian for information."""
collection: str # Which indexed collection
question: str # Natural language question
max_chunks: int = 10 # Limit for recursive queries
include_sources: bool = True
@xmlify
@dataclass
class LibrarianResponse:
"""Response from librarian with sources."""
answer: str
sources: list[SourceReference]
confidence: float
async def handle_librarian_query(
payload: LibrarianQuery,
metadata: HandlerMetadata
) -> HandlerResponse:
"""RLM-style query handler."""
# 1. Search for relevant chunks
hits = await search_collection(
payload.collection,
payload.question,
limit=50 # Cast wide net
)
# 2. Filter if needed
if len(hits) > payload.max_chunks:
hits = await llm_filter_chunks(
hits,
payload.question,
limit=payload.max_chunks
)
# 3. Recursive sub-queries
findings = await asyncio.gather(*[
llm_analyze_chunk(chunk, payload.question)
for chunk in hits
])
# 4. Synthesize answer
answer = await llm_synthesize(findings, payload.question)
# 5. Build response
sources = [
SourceReference(
path=hit.source_path,
lines=(hit.start_line, hit.end_line),
relevance=hit.score
)
for hit in hits
]
return HandlerResponse.respond(
payload=LibrarianResponse(
answer=answer,
sources=sources if payload.include_sources else [],
confidence=calculate_confidence(findings)
)
)
Storage Layer
eXist-db (Primary Store)
XML-native database for chunk storage and XQuery retrieval.
Why eXist-db:
- Native XQuery for complex queries
- Full-text search with Lucene
- XML validation against schemas
- Transactional updates
Collections structure:
/db/librarian/
├── collections/
│ ├── {user_id}/
│ │ ├── {collection_id}/
│ │ │ ├── metadata.xml
│ │ │ ├── chunks/
│ │ │ │ ├── chunk_001.xml
│ │ │ │ ├── chunk_002.xml
│ │ │ │ └── ...
│ │ │ └── index/
│ │ │ └── embeddings.bin
Vector Embeddings
For semantic search, chunks are embedded using:
- OpenAI
text-embedding-3-small(cloud) - Sentence Transformers (local/self-hosted)
Embeddings stored alongside chunks or in dedicated vector DB (Qdrant/Pinecone for scale).
Dependency Graph
For code collections, track relationships:
- Class hierarchy: inheritance, interfaces
- Imports: file dependencies
- Call graph: function → function references
Stored in eXist-db as XML or external graph DB for complex traversals.
Configuration
organism.yaml
listeners:
- name: librarian
handler: xml_pipeline.tools.librarian.handle_librarian_query
payload_class: xml_pipeline.tools.librarian.LibrarianQuery
description: Query indexed document collections
agent: true
peers: [] # Terminal handler
config:
exist_db:
url: "http://localhost:8080/exist"
user_env: EXIST_USER
password_env: EXIST_PASSWORD
embeddings:
provider: openai # or "local"
model: text-embedding-3-small
chunkers:
code:
max_chunk_size: 2000
overlap: 200
markdown:
split_on_headings: true
min_heading_level: 2
Ingestion API
# Ingest a git repository
await librarian.ingest(
source=GitSource(
url="https://github.com/Open-Cascade-SAS/OCCT",
branch="master",
filter="src/**/*.cxx"
),
collection="opencascade",
chunker_config=CodeChunkerConfig(
language="cpp",
max_chunk_size=2000
)
)
# Query the collection
response = await librarian.query(
collection="opencascade",
question="How does BRepBuilderAPI_MakeEdge handle curves?"
)
Scaling Considerations
| Scale | Storage | Search | Compute |
|---|---|---|---|
| Small (<10K chunks) | eXist-db local | In-DB Lucene | Single node |
| Medium (10K-1M) | eXist-db cluster | + Vector DB | Multi-worker |
| Large (1M+) | Sharded storage | Distributed search | GPU embeddings |
Security
- Collection isolation: Users can only query their own collections
- WASM sandbox: Custom chunkers run in isolated WASM runtime
- Rate limiting: Prevent abuse of recursive queries
- Audit logging: Track all queries for compliance
Future Enhancements
- Incremental updates: Re-index only changed files
- Cross-collection queries: Search across multiple codebases
- Collaborative collections: Shared team libraries
- Query caching: Cache common sub-queries
- Streaming ingestion: Real-time updates from git webhooks
References
- Recursive Language Models (MIT) — Foundational research on RLM pattern
- tree-sitter — AST-aware code parsing
- eXist-db — XML-native database
- BloxServer Architecture — Platform overview