Features: - Ingest entire codebases (millions of tokens) - 4 built-in chunkers: code, prose, structured, tabular - Custom WASM chunker escape hatch - eXist-db storage with versioning - RLM query processing for natural language questions - Structural indexing (call graph, type hierarchy, symbols) Use cases: - Legacy code understanding - API discovery - Impact analysis - Developer onboarding Premium tier pricing model included. Co-authored-by: Dan
10 KiB
Premium Librarian — RLM-Powered Codebase Intelligence
Status: Design Spec
Author: Dan & Donna
Date: 2026-01-26
Overview
The Premium Librarian is an RLM-powered (Recursive Language Model) tool that can ingest entire codebases or document corpora — millions of tokens — and answer natural language queries about them.
Unlike traditional code search (grep, ripgrep, Sourcegraph), the Premium Librarian understands structure, relationships, and intent. It can answer questions like:
- "Where are edges calculated in OpenCASCADE?"
- "How does the authentication flow work?"
- "What would break if I changed this interface?"
Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ PREMIUM LIBRARIAN │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Ingest │───▶│ Chunker │───▶│ eXist-db │ │
│ │ (upload) │ │ (4 types │ │ (versioned storage) │ │
│ │ │ │ + WASM) │ │ │ │
│ └─────────────┘ └─────────────┘ └───────────┬─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Query │───▶│ RLM │───▶│ Index / Map │ │
│ │ (natural │ │ Processor │ │ (structure, relations) │ │
│ │ language) │ │ │ │ │ │
│ └─────────────┘ └──────┬──────┘ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Response │ │
│ │ (answer + │ │
│ │ sources) │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Chunkers
The RLM algorithm needs content-aware chunking. Different content types have different natural boundaries.
Built-in Chunkers
| Chunker | Content Types | Chunking Strategy |
|---|---|---|
| Code | .py, .js, .ts, .cpp, .c, .java, .go, .rs, etc. | Functions, classes, modules. Preserves imports, docstrings, signatures. |
| Prose | .md, .txt, .rst, .adoc | Paragraphs, sections, chapters. Preserves headings, structure. |
| Structured | .yaml, .json, .xml, .toml, .ini | Schema-aware. Preserves hierarchy, keys, nesting. |
| Tabular | .csv, .tsv, .parquet | Row groups with headers. Preserves column semantics. |
Content Type Detection
- File extension mapping (fast path)
- MIME type detection
- Content sniffing (magic bytes, heuristics)
- User override via config
Custom WASM Chunkers
For specialized formats, users can provide their own chunker:
// chunker.ts (AssemblyScript)
import { Chunk, ChunkMetadata } from "@openblox/chunker-sdk";
export function chunk(content: string, metadata: ChunkMetadata): Chunk[] {
// Custom logic for proprietary format
const chunks: Chunk[] = [];
// Parse your format, emit chunks
// Each chunk has: content, startLine, endLine, type, name
return chunks;
}
Compile & upload:
asc chunker.ts -o chunker.wasm
# Upload via API or CLI
Security:
- WASM runs sandboxed (can't escape, can't access filesystem)
- CPU/memory limits enforced
- Chunker is pure function: string in, chunks out
Ingest Pipeline
1. Upload
POST /api/v1/libraries
{
"name": "opencascade",
"source": {
"type": "git",
"url": "https://github.com/Open-Cascade-SAS/OCCT.git",
"branch": "master"
},
// OR
"source": {
"type": "upload",
"archive": "<base64 tarball>"
}
}
2. Chunking
For each file:
- Detect content type
- Select chunker (built-in or custom WASM)
- Chunk content
- Store chunks in eXist-db with metadata
Chunk metadata:
{
"id": "chunk-uuid",
"library_id": "lib-uuid",
"file_path": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx",
"start_line": 142,
"end_line": 387,
"type": "function",
"name": "BRepBuilderAPI_MakeEdge::Build",
"language": "cpp",
"imports": ["gp_Pnt", "TopoDS_Edge", "BRepLib"],
"calls": ["GCPnts_TangentialDeflection", "BRep_Builder"],
"version": "v7.8.0",
"indexed_at": "2026-01-26T..."
}
3. Indexing (Background)
After chunking, RLM builds structural index:
- Call graph: What calls what
- Type hierarchy: Classes, inheritance
- Module map: How code is organized
- Symbol table: Functions, classes, constants
- Dependency graph: Imports, includes
This runs as a background job (can take hours for large codebases).
Query Pipeline
1. Query
POST /api/v1/libraries/{id}/query
{
"question": "Where are edges calculated?",
"max_tokens": 8000,
"include_sources": true
}
2. RLM Processing
The RLM receives:
You have access to a library "opencascade" with 2.3M lines of C++ code.
Structural index available:
- 4,231 classes
- 47,892 functions
- 12 main modules: BRepBuilderAPI, BRepAlgoAPI, ...
User question: "Where are edges calculated?"
Available tools:
- search(query) → relevant chunks
- get_chunk(id) → full chunk content
- get_structure(path) → module/class structure
- recursive_query(sub_question) → ask yourself about a subset
RLM then:
- Searches for "edge" in symbol table
- Finds
BRepBuilderAPI_MakeEdge,BRepAlgo_EdgeConnector, etc. - Recursively queries: "What does BRepBuilderAPI_MakeEdge do?"
- Retrieves relevant chunks, synthesizes answer
3. Response
{
"answer": "Edge calculation in OpenCASCADE primarily happens in the BRepBuilderAPI module...",
"sources": [
{
"file": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx",
"lines": "142-387",
"relevance": 0.95,
"snippet": "void BRepBuilderAPI_MakeEdge::Build() { ... }"
},
{
"file": "src/BRepAlgo/BRepAlgo_EdgeConnector.cxx",
"lines": "89-201",
"relevance": 0.82,
"snippet": "..."
}
],
"tokens_used": 47832,
"chunks_examined": 127,
"cost": "$0.34"
}
Storage (eXist-db)
Why eXist-db?
- Versioning: Track codebase changes over time
- XML-native: Fits with xml-pipeline philosophy
- XQuery: Powerful querying for structured data
- Efficient: Handles millions of documents
Schema
<library id="opencascade" version="v7.8.0">
<metadata>
<name>OpenCASCADE</name>
<source>git:https://github.com/Open-Cascade-SAS/OCCT.git</source>
<indexed_at>2026-01-26T08:00:00Z</indexed_at>
<stats>
<files>12847</files>
<lines>2341892</lines>
<chunks>89234</chunks>
</stats>
</metadata>
<structure>
<module name="BRepBuilderAPI">
<class name="BRepBuilderAPI_MakeEdge">
<function name="Build" file="..." lines="142-387"/>
...
</class>
</module>
</structure>
<chunks>
<chunk id="..." file="..." type="function" name="Build">
<content>...</content>
<relations>
<calls>GCPnts_TangentialDeflection</calls>
<imports>gp_Pnt</imports>
</relations>
</chunk>
</chunks>
</library>
Pricing (Premium Tier)
| Operation | Cost |
|---|---|
| Ingest (per 1M tokens) | $2.00 |
| Index (per library) | $5.00 - $50.00 (depends on size) |
| Query (per query) | $0.10 - $2.00 (depends on complexity) |
| Storage (per GB/month) | $0.50 |
Why premium:
- Compute-intensive (lots of LLM calls)
- Storage-intensive (versioned codebases)
- High value (saves weeks of manual exploration)
Use Cases
1. Legacy Code Understanding
"I inherited a 500K line Fortran codebase. Help me understand the data flow."
2. API Discovery
"How do I create a NURBS surface with specific knot vectors?"
3. Impact Analysis
"What would break if I deprecated this function?"
4. Onboarding
"Explain the architecture of this codebase to a new developer."
5. Code Review Assistance
"Does this change follow the patterns used elsewhere in the codebase?"
Implementation Phases
Phase 1: MVP
- Basic ingest (git clone, tarball upload)
- Code chunker (Python, JavaScript, C++)
- eXist-db storage
- Simple RLM query (search + retrieve)
Phase 2: Full Chunkers
- Prose chunker
- Structured chunker (YAML/JSON/XML)
- Tabular chunker
- WASM chunker SDK
Phase 3: Deep Indexing
- Call graph extraction
- Type hierarchy
- Cross-reference index
- Incremental re-indexing on changes
Phase 4: Advanced Queries
- Multi-turn conversations about code
- "What if" analysis
- Code generation informed by codebase patterns
"Finally understand the codebase you inherited."