# Premium Librarian — RLM-Powered Codebase Intelligence **Status:** Design Spec **Author:** Dan & Donna **Date:** 2026-01-26 ## Overview The Premium Librarian is an RLM-powered (Recursive Language Model) tool that can ingest entire codebases or document corpora — millions of tokens — and answer natural language queries about them. Unlike traditional code search (grep, ripgrep, Sourcegraph), the Premium Librarian *understands* structure, relationships, and intent. It can answer questions like: - "Where are edges calculated in OpenCASCADE?" - "How does the authentication flow work?" - "What would break if I changed this interface?" ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────┐ │ PREMIUM LIBRARIAN │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ Ingest │───▶│ Chunker │───▶│ eXist-db │ │ │ │ (upload) │ │ (4 types │ │ (versioned storage) │ │ │ │ │ │ + WASM) │ │ │ │ │ └─────────────┘ └─────────────┘ └───────────┬─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ Query │───▶│ RLM │───▶│ Index / Map │ │ │ │ (natural │ │ Processor │ │ (structure, relations) │ │ │ │ language) │ │ │ │ │ │ │ └─────────────┘ └──────┬──────┘ └─────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Response │ │ │ │ (answer + │ │ │ │ sources) │ │ │ └─────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Chunkers The RLM algorithm needs content-aware chunking. Different content types have different natural boundaries. ### Built-in Chunkers | Chunker | Content Types | Chunking Strategy | |---------|---------------|-------------------| | **Code** | .py, .js, .ts, .cpp, .c, .java, .go, .rs, etc. | Functions, classes, modules. Preserves imports, docstrings, signatures. | | **Prose** | .md, .txt, .rst, .adoc | Paragraphs, sections, chapters. Preserves headings, structure. | | **Structured** | .yaml, .json, .xml, .toml, .ini | Schema-aware. Preserves hierarchy, keys, nesting. | | **Tabular** | .csv, .tsv, .parquet | Row groups with headers. Preserves column semantics. | ### Content Type Detection 1. File extension mapping (fast path) 2. MIME type detection 3. Content sniffing (magic bytes, heuristics) 4. User override via config ### Custom WASM Chunkers For specialized formats, users can provide their own chunker: ```typescript // chunker.ts (AssemblyScript) import { Chunk, ChunkMetadata } from "@openblox/chunker-sdk"; export function chunk(content: string, metadata: ChunkMetadata): Chunk[] { // Custom logic for proprietary format const chunks: Chunk[] = []; // Parse your format, emit chunks // Each chunk has: content, startLine, endLine, type, name return chunks; } ``` **Compile & upload:** ```bash asc chunker.ts -o chunker.wasm # Upload via API or CLI ``` **Security:** - WASM runs sandboxed (can't escape, can't access filesystem) - CPU/memory limits enforced - Chunker is pure function: string in, chunks out ## Ingest Pipeline ### 1. Upload ``` POST /api/v1/libraries { "name": "opencascade", "source": { "type": "git", "url": "https://github.com/Open-Cascade-SAS/OCCT.git", "branch": "master" }, // OR "source": { "type": "upload", "archive": "" } } ``` ### 2. Chunking For each file: 1. Detect content type 2. Select chunker (built-in or custom WASM) 3. Chunk content 4. Store chunks in eXist-db with metadata **Chunk metadata:** ```json { "id": "chunk-uuid", "library_id": "lib-uuid", "file_path": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx", "start_line": 142, "end_line": 387, "type": "function", "name": "BRepBuilderAPI_MakeEdge::Build", "language": "cpp", "imports": ["gp_Pnt", "TopoDS_Edge", "BRepLib"], "calls": ["GCPnts_TangentialDeflection", "BRep_Builder"], "version": "v7.8.0", "indexed_at": "2026-01-26T..." } ``` ### 3. Indexing (Background) After chunking, RLM builds structural index: - **Call graph**: What calls what - **Type hierarchy**: Classes, inheritance - **Module map**: How code is organized - **Symbol table**: Functions, classes, constants - **Dependency graph**: Imports, includes This runs as a background job (can take hours for large codebases). ## Query Pipeline ### 1. Query ``` POST /api/v1/libraries/{id}/query { "question": "Where are edges calculated?", "max_tokens": 8000, "include_sources": true } ``` ### 2. RLM Processing The RLM receives: ``` You have access to a library "opencascade" with 2.3M lines of C++ code. Structural index available: - 4,231 classes - 47,892 functions - 12 main modules: BRepBuilderAPI, BRepAlgoAPI, ... User question: "Where are edges calculated?" Available tools: - search(query) → relevant chunks - get_chunk(id) → full chunk content - get_structure(path) → module/class structure - recursive_query(sub_question) → ask yourself about a subset ``` RLM then: 1. Searches for "edge" in symbol table 2. Finds `BRepBuilderAPI_MakeEdge`, `BRepAlgo_EdgeConnector`, etc. 3. Recursively queries: "What does BRepBuilderAPI_MakeEdge do?" 4. Retrieves relevant chunks, synthesizes answer ### 3. Response ```json { "answer": "Edge calculation in OpenCASCADE primarily happens in the BRepBuilderAPI module...", "sources": [ { "file": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx", "lines": "142-387", "relevance": 0.95, "snippet": "void BRepBuilderAPI_MakeEdge::Build() { ... }" }, { "file": "src/BRepAlgo/BRepAlgo_EdgeConnector.cxx", "lines": "89-201", "relevance": 0.82, "snippet": "..." } ], "tokens_used": 47832, "chunks_examined": 127, "cost": "$0.34" } ``` ## Storage (eXist-db) ### Why eXist-db? - **Versioning**: Track codebase changes over time - **XML-native**: Fits with xml-pipeline philosophy - **XQuery**: Powerful querying for structured data - **Efficient**: Handles millions of documents ### Schema ```xml OpenCASCADE git:https://github.com/Open-Cascade-SAS/OCCT.git 2026-01-26T08:00:00Z 12847 2341892 89234 ... ... GCPnts_TangentialDeflection gp_Pnt ``` ## Pricing (Premium Tier) | Operation | Cost | |-----------|------| | Ingest (per 1M tokens) | $2.00 | | Index (per library) | $5.00 - $50.00 (depends on size) | | Query (per query) | $0.10 - $2.00 (depends on complexity) | | Storage (per GB/month) | $0.50 | **Why premium:** - Compute-intensive (lots of LLM calls) - Storage-intensive (versioned codebases) - High value (saves weeks of manual exploration) ## Use Cases ### 1. Legacy Code Understanding "I inherited a 500K line Fortran codebase. Help me understand the data flow." ### 2. API Discovery "How do I create a NURBS surface with specific knot vectors?" ### 3. Impact Analysis "What would break if I deprecated this function?" ### 4. Onboarding "Explain the architecture of this codebase to a new developer." ### 5. Code Review Assistance "Does this change follow the patterns used elsewhere in the codebase?" ## Implementation Phases ### Phase 1: MVP - [ ] Basic ingest (git clone, tarball upload) - [ ] Code chunker (Python, JavaScript, C++) - [ ] eXist-db storage - [ ] Simple RLM query (search + retrieve) ### Phase 2: Full Chunkers - [ ] Prose chunker - [ ] Structured chunker (YAML/JSON/XML) - [ ] Tabular chunker - [ ] WASM chunker SDK ### Phase 3: Deep Indexing - [ ] Call graph extraction - [ ] Type hierarchy - [ ] Cross-reference index - [ ] Incremental re-indexing on changes ### Phase 4: Advanced Queries - [ ] Multi-turn conversations about code - [ ] "What if" analysis - [ ] Code generation informed by codebase patterns --- *"Finally understand the codebase you inherited."*