xml-pipeline/docs/librarian-architecture.md
dullfig d97c24b1dd
Some checks failed
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
CI / test (3.13) (push) Has been cancelled
CI / lint (push) Has been cancelled
CI / typecheck (push) Has been cancelled
Add message journal, graceful restart, and clean repo for public release
Three workstreams implemented:

W1 (Repo Split): Remove proprietary BloxServer files and docs, update
pyproject.toml URLs to public GitHub, clean doc references, add CI
workflow (.github/workflows/ci.yml) and CONTRIBUTING.md.

W2 (Message Journal): Add DispatchHook protocol for dispatch lifecycle
events, SQLite-backed MessageJournal with WAL mode for certified-mail
delivery guarantees (PENDING→DISPATCHED→ACKED/FAILED), integrate hooks
into StreamPump._dispatch_to_handlers(), add journal REST endpoints,
and aiosqlite dependency.

W3 (Hot Deployment): Add RestartOrchestrator for graceful restart with
queue drain and journal stats collection, SIGHUP signal handler in CLI,
POST /organism/restart endpoint, restart-aware app lifespan with journal
recovery on boot, and os.execv/subprocess re-exec for Unix/Windows.

All 439 tests pass (37 new tests for W2/W3).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:27:38 -08:00

513 lines
20 KiB
Markdown

# Librarian Architecture — RLM-Powered Document Intelligence
**Status:** Design
**Date:** January 2026
## Overview
The Librarian is an agent that ingests, indexes, and queries large document collections using the **Recursive Language Model (RLM)** pattern. It can handle codebases, documentation, and structured data at scales far beyond LLM context windows (10M+ tokens).
Key insight from [MIT RLM research](https://arxiv.org/abs/...): Long contexts should be loaded as **variables in a REPL environment**, not fed directly to the neural network. The LLM writes code to examine, decompose, and recursively query chunks.
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ RLM-Powered Librarian │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Ingestion Pipeline │ │
│ │ │ │
│ │ Source → Detect Type → Select Chunker → Index → Store │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Query Engine (RLM Pattern) │ │
│ │ │ │
│ │ Query → Search → Filter → Recursive Sub-Query → Answer │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Storage Layer │ │
│ │ │ │
│ │ eXist-db (XML) + Vector Embeddings + Dependency Graph │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## The RLM Pattern
Traditional LLM usage stuffs entire documents into the prompt. This fails at scale:
- Context windows have hard limits (128K-1M tokens)
- Performance degrades with context length ("context rot")
- Cost scales linearly with input size
**RLM approach:**
1. **Load as Variable**: Documents become references, not inline content
2. **Programmatic Access**: LLM writes code to peek into chunks
3. **Recursive Sub-Queries**: `llm_query(chunk, question)` for focused analysis
4. **Aggregation**: Combine sub-query results into final answer
```python
# RLM-style pseudocode
async def handle_query(query: str, codebase: CodebaseRef):
# 1. Search index for relevant chunks (not full content)
hits = await search_index(codebase, query)
# 2. Filter if too many results
if len(hits) > 10:
hits = await llm_filter(hits, query) # LLM picks most relevant
# 3. Recursive sub-queries on each chunk
findings = []
for hit in hits:
chunk = await load_chunk(hit)
result = await llm_query(
f"Analyze this for: {query}\n\n{chunk}"
)
findings.append(result)
# 4. Aggregate into final answer
return await llm_synthesize(findings, query)
```
## Hybrid Chunking Architecture
Chunking is domain-specific. A C++ class should stay together; a legal clause shouldn't be split mid-sentence. We use a hybrid approach:
### Built-in Chunkers (Fast Path)
| Chunker | File Types | Strategy | Implementation |
|---------|------------|----------|----------------|
| **Code** | .c, .cpp, .py, .js, .rs, ... | AST-aware splitting | tree-sitter |
| **Markdown/Docs** | .md, .rst, .txt | Heading hierarchy | Custom parser |
| **Structured Data** | .json, .xml, .yaml | Schema-aware | lxml + json |
| **Plain Text** | emails, logs, notes | Semantic paragraphs | Sentence boundaries |
These cover ~90% of use cases with optimized, predictable behavior.
### WASM Factory (Fallback for Unknown Types)
For novel formats, the AI generates a custom chunker:
```
User uploads proprietary format
┌───────────────────────────────────────────────────────────┐
│ Step 1: Sample Analysis │
│ │
│ AI examines sample files: │
│ - Structure patterns │
│ - Record boundaries │
│ - Semantic units │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 2: Generate Chunker (Rust → WASM) │
│ │
│ AI writes Rust code implementing the chunker interface │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 3: Compile & Validate │
│ │
│ cargo build --target wasm32-wasi │
│ Test on sample files │
│ AI reviews output quality │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 4: Deploy │
│ │
│ Store in user's WASM modules │
│ Optional: publish to marketplace │
└───────────────────────────────────────────────────────────┘
```
### WASM Chunker Interface (WIT)
```wit
// chunker.wit
interface chunker {
record chunk {
id: string,
content: string,
metadata: list<tuple<string, string>>,
parent-id: option<string>,
children: list<string>,
}
record chunker-config {
file-type: string,
max-chunk-size: u32,
preserve-context: bool,
custom-params: list<tuple<string, string>>,
}
// Analyze sample data, return chunking config
analyze: func(sample: string, file-type: string) -> chunker-config
// Chunk a file using the config
chunk-file: func(content: string, config: chunker-config) -> list<chunk>
}
```
## Ingestion Pipeline
### Step 1: Source Acquisition
```python
@dataclass
class IngestionSource:
type: Literal["git", "upload", "url", "s3"]
location: str
filter: str | None = None # e.g., "*.cpp", "docs/**/*.md"
```
Supported sources:
- **Git repository**: Clone and track branches
- **File upload**: Direct upload via UI
- **URL**: Fetch remote documents
- **S3/Cloud storage**: Enterprise integrations
### Step 2: Type Detection
```python
def detect_type(file_path: str, content: bytes) -> FileType:
# 1. Check extension
ext = Path(file_path).suffix.lower()
if ext in CODE_EXTENSIONS:
return FileType.CODE
# 2. Check magic bytes
if content.startswith(b'%PDF'):
return FileType.PDF
# 3. Content analysis
if looks_like_markdown(content):
return FileType.MARKDOWN
return FileType.PLAIN_TEXT
```
### Step 3: Chunking
```python
def select_chunker(file_type: FileType, user_config: ChunkerConfig) -> Chunker:
# User override
if user_config.custom_wasm:
return WasmChunker(user_config.custom_wasm)
# Built-in chunkers
match file_type:
case FileType.CODE:
return TreeSitterChunker(language=detect_language(file_type))
case FileType.MARKDOWN:
return MarkdownChunker()
case FileType.JSON | FileType.XML | FileType.YAML:
return StructuredDataChunker()
case _:
return PlainTextChunker()
```
### Step 4: Indexing
Each chunk is indexed in multiple ways:
| Index Type | Purpose | Implementation |
|------------|---------|----------------|
| **Full-text** | Keyword search | eXist-db Lucene |
| **Vector** | Semantic similarity | Embeddings (OpenAI/local) |
| **Graph** | Relationships | Class hierarchy, imports, references |
| **Metadata** | Filtering | File path, type, timestamp |
### Step 5: Storage
```xml
<!-- Chunk stored in eXist-db -->
<chunk xmlns="https://xml-pipeline.org/ns/librarian/v1">
<id>opencascade:BRepBuilderAPI_MakeEdge:constructor_1</id>
<source>
<repo>opencascade</repo>
<path>src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx</path>
<lines start="42" end="87"/>
</source>
<type>function</type>
<metadata>
<class>BRepBuilderAPI_MakeEdge</class>
<visibility>public</visibility>
<params>const TopoDS_Vertex&amp;, const TopoDS_Vertex&amp;</params>
</metadata>
<content><![CDATA[
BRepBuilderAPI_MakeEdge::BRepBuilderAPI_MakeEdge(
const TopoDS_Vertex& V1,
const TopoDS_Vertex& V2)
{
// ... implementation
}
]]></content>
<embedding>[0.023, -0.041, 0.089, ...]</embedding>
</chunk>
```
## Query Engine
### Query Flow
```
User: "How does BRepBuilderAPI_MakeEdge handle degenerate curves?"
┌───────────────────────────────────────────────────────────┐
│ Step 1: Search │
│ │
│ - Vector search: find semantically similar chunks │
│ - Keyword search: "BRepBuilderAPI_MakeEdge" + "degenerate"│
│ - Graph traversal: class hierarchy, method calls │
│ │
│ Result: 47 potentially relevant chunks │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 2: Filter (LLM-assisted) │
│ │
│ Too many chunks for direct analysis. │
│ LLM reviews summaries, picks top 8 most relevant. │
│ │
│ Selected: │
│ - BRepBuilderAPI_MakeEdge constructors (3 chunks) │
│ - Edge validation methods (2 chunks) │
│ - Degenerate curve handling (2 chunks) │
│ - Error reporting (1 chunk) │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 3: Recursive Sub-Queries │
│ │
│ For each chunk, focused LLM query: │
│ │
│ llm_query(chunk_1, "How does this handle degenerate...") │
│ llm_query(chunk_2, "What validation happens here...") │
│ llm_query(chunk_3, "What errors are raised for...") │
│ ... │
│ │
│ 8 parallel sub-queries → 8 focused findings │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ Step 4: Synthesize │
│ │
│ LLM combines findings into coherent answer: │
│ │
│ "BRepBuilderAPI_MakeEdge handles degenerate curves by: │
│ 1. Checking curve bounds in the constructor... │
│ 2. Calling BRepCheck_Edge for validation... │
│ 3. Setting myError to BRepBuilderAPI_CurveTooSmall..." │
└───────────────────────────────────────────────────────────┘
```
### Handler Implementation
```python
@xmlify
@dataclass
class LibrarianQuery:
"""Query the librarian for information."""
collection: str # Which indexed collection
question: str # Natural language question
max_chunks: int = 10 # Limit for recursive queries
include_sources: bool = True
@xmlify
@dataclass
class LibrarianResponse:
"""Response from librarian with sources."""
answer: str
sources: list[SourceReference]
confidence: float
async def handle_librarian_query(
payload: LibrarianQuery,
metadata: HandlerMetadata
) -> HandlerResponse:
"""RLM-style query handler."""
# 1. Search for relevant chunks
hits = await search_collection(
payload.collection,
payload.question,
limit=50 # Cast wide net
)
# 2. Filter if needed
if len(hits) > payload.max_chunks:
hits = await llm_filter_chunks(
hits,
payload.question,
limit=payload.max_chunks
)
# 3. Recursive sub-queries
findings = await asyncio.gather(*[
llm_analyze_chunk(chunk, payload.question)
for chunk in hits
])
# 4. Synthesize answer
answer = await llm_synthesize(findings, payload.question)
# 5. Build response
sources = [
SourceReference(
path=hit.source_path,
lines=(hit.start_line, hit.end_line),
relevance=hit.score
)
for hit in hits
]
return HandlerResponse.respond(
payload=LibrarianResponse(
answer=answer,
sources=sources if payload.include_sources else [],
confidence=calculate_confidence(findings)
)
)
```
## Storage Layer
### eXist-db (Primary Store)
XML-native database for chunk storage and XQuery retrieval.
**Why eXist-db:**
- Native XQuery for complex queries
- Full-text search with Lucene
- XML validation against schemas
- Transactional updates
**Collections structure:**
```
/db/librarian/
├── collections/
│ ├── {user_id}/
│ │ ├── {collection_id}/
│ │ │ ├── metadata.xml
│ │ │ ├── chunks/
│ │ │ │ ├── chunk_001.xml
│ │ │ │ ├── chunk_002.xml
│ │ │ │ └── ...
│ │ │ └── index/
│ │ │ └── embeddings.bin
```
### Vector Embeddings
For semantic search, chunks are embedded using:
- OpenAI `text-embedding-3-small` (cloud)
- Sentence Transformers (local/self-hosted)
Embeddings stored alongside chunks or in dedicated vector DB (Qdrant/Pinecone for scale).
### Dependency Graph
For code collections, track relationships:
- **Class hierarchy**: inheritance, interfaces
- **Imports**: file dependencies
- **Call graph**: function → function references
Stored in eXist-db as XML or external graph DB for complex traversals.
## Configuration
### organism.yaml
```yaml
listeners:
- name: librarian
handler: xml_pipeline.tools.librarian.handle_librarian_query
payload_class: xml_pipeline.tools.librarian.LibrarianQuery
description: Query indexed document collections
agent: true
peers: [] # Terminal handler
config:
exist_db:
url: "http://localhost:8080/exist"
user_env: EXIST_USER
password_env: EXIST_PASSWORD
embeddings:
provider: openai # or "local"
model: text-embedding-3-small
chunkers:
code:
max_chunk_size: 2000
overlap: 200
markdown:
split_on_headings: true
min_heading_level: 2
```
### Ingestion API
```python
# Ingest a git repository
await librarian.ingest(
source=GitSource(
url="https://github.com/Open-Cascade-SAS/OCCT",
branch="master",
filter="src/**/*.cxx"
),
collection="opencascade",
chunker_config=CodeChunkerConfig(
language="cpp",
max_chunk_size=2000
)
)
# Query the collection
response = await librarian.query(
collection="opencascade",
question="How does BRepBuilderAPI_MakeEdge handle curves?"
)
```
## Scaling Considerations
| Scale | Storage | Search | Compute |
|-------|---------|--------|---------|
| Small (<10K chunks) | eXist-db local | In-DB Lucene | Single node |
| Medium (10K-1M) | eXist-db cluster | + Vector DB | Multi-worker |
| Large (1M+) | Sharded storage | Distributed search | GPU embeddings |
## Security
- **Collection isolation**: Users can only query their own collections
- **WASM sandbox**: Custom chunkers run in isolated WASM runtime
- **Rate limiting**: Prevent abuse of recursive queries
- **Audit logging**: Track all queries for compliance
## Future Enhancements
1. **Incremental updates**: Re-index only changed files
2. **Cross-collection queries**: Search across multiple codebases
3. **Collaborative collections**: Shared team libraries
4. **Query caching**: Cache common sub-queries
5. **Streaming ingestion**: Real-time updates from git webhooks
---
## References
- [Recursive Language Models (MIT)](docs/mit-paper.pdf) Foundational research on RLM pattern
- [tree-sitter](https://tree-sitter.github.io/) AST-aware code parsing
- [eXist-db](http://exist-db.org/) XML-native database
- [Core Principles](core-principles-v2.1.md) Architecture overview