Donna f9b873d331 Add Premium Librarian spec — RLM-powered codebase intelligence

Features:
- Ingest entire codebases (millions of tokens)
- 4 built-in chunkers: code, prose, structured, tabular
- Custom WASM chunker escape hatch
- eXist-db storage with versioning
- RLM query processing for natural language questions
- Structural indexing (call graph, type hierarchy, symbols)

Use cases:
- Legacy code understanding
- API discovery
- Impact analysis
- Developer onboarding

Premium tier pricing model included.

Co-authored-by: Dan

2026-01-26 07:32:07 +00:00

10 KiB

Raw Blame History

Premium Librarian — RLM-Powered Codebase Intelligence

Status: Design Spec
Author: Dan & Donna
Date: 2026-01-26

Overview

The Premium Librarian is an RLM-powered (Recursive Language Model) tool that can ingest entire codebases or document corpora — millions of tokens — and answer natural language queries about them.

Unlike traditional code search (grep, ripgrep, Sourcegraph), the Premium Librarian understands structure, relationships, and intent. It can answer questions like:

"Where are edges calculated in OpenCASCADE?"
"How does the authentication flow work?"
"What would break if I changed this interface?"

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        PREMIUM LIBRARIAN                            │
│                                                                     │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │   Ingest    │───▶│   Chunker   │───▶│      eXist-db           │ │
│  │   (upload)  │    │  (4 types   │    │  (versioned storage)    │ │
│  │             │    │   + WASM)   │    │                         │ │
│  └─────────────┘    └─────────────┘    └───────────┬─────────────┘ │
│                                                     │               │
│                                                     ▼               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │   Query     │───▶│    RLM      │───▶│     Index / Map         │ │
│  │  (natural   │    │  Processor  │    │  (structure, relations) │ │
│  │   language) │    │             │    │                         │ │
│  └─────────────┘    └──────┬──────┘    └─────────────────────────┘ │
│                            │                                        │
│                            ▼                                        │
│                    ┌─────────────┐                                  │
│                    │  Response   │                                  │
│                    │  (answer +  │                                  │
│                    │   sources)  │                                  │
│                    └─────────────┘                                  │
└─────────────────────────────────────────────────────────────────────┘

Chunkers

The RLM algorithm needs content-aware chunking. Different content types have different natural boundaries.

Built-in Chunkers

Chunker	Content Types	Chunking Strategy
Code	.py, .js, .ts, .cpp, .c, .java, .go, .rs, etc.	Functions, classes, modules. Preserves imports, docstrings, signatures.
Prose	.md, .txt, .rst, .adoc	Paragraphs, sections, chapters. Preserves headings, structure.
Structured	.yaml, .json, .xml, .toml, .ini	Schema-aware. Preserves hierarchy, keys, nesting.
Tabular	.csv, .tsv, .parquet	Row groups with headers. Preserves column semantics.

Content Type Detection

File extension mapping (fast path)
MIME type detection
Content sniffing (magic bytes, heuristics)
User override via config

Custom WASM Chunkers

For specialized formats, users can provide their own chunker:

// chunker.ts (AssemblyScript)
import { Chunk, ChunkMetadata } from "@openblox/chunker-sdk";

export function chunk(content: string, metadata: ChunkMetadata): Chunk[] {
  // Custom logic for proprietary format
  const chunks: Chunk[] = [];
  
  // Parse your format, emit chunks
  // Each chunk has: content, startLine, endLine, type, name
  
  return chunks;
}

Compile & upload:

asc chunker.ts -o chunker.wasm
# Upload via API or CLI

Security:

WASM runs sandboxed (can't escape, can't access filesystem)
CPU/memory limits enforced
Chunker is pure function: string in, chunks out

Ingest Pipeline

1. Upload

POST /api/v1/libraries
{
  "name": "opencascade",
  "source": {
    "type": "git",
    "url": "https://github.com/Open-Cascade-SAS/OCCT.git",
    "branch": "master"
  },
  // OR
  "source": {
    "type": "upload",
    "archive": "<base64 tarball>"
  }
}

2. Chunking

For each file:

Detect content type
Select chunker (built-in or custom WASM)
Chunk content
Store chunks in eXist-db with metadata

Chunk metadata:

{
  "id": "chunk-uuid",
  "library_id": "lib-uuid",
  "file_path": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx",
  "start_line": 142,
  "end_line": 387,
  "type": "function",
  "name": "BRepBuilderAPI_MakeEdge::Build",
  "language": "cpp",
  "imports": ["gp_Pnt", "TopoDS_Edge", "BRepLib"],
  "calls": ["GCPnts_TangentialDeflection", "BRep_Builder"],
  "version": "v7.8.0",
  "indexed_at": "2026-01-26T..."
}

3. Indexing (Background)

After chunking, RLM builds structural index:

Call graph: What calls what
Type hierarchy: Classes, inheritance
Module map: How code is organized
Symbol table: Functions, classes, constants
Dependency graph: Imports, includes

This runs as a background job (can take hours for large codebases).

Query Pipeline

1. Query

POST /api/v1/libraries/{id}/query
{
  "question": "Where are edges calculated?",
  "max_tokens": 8000,
  "include_sources": true
}

2. RLM Processing

The RLM receives:

You have access to a library "opencascade" with 2.3M lines of C++ code.

Structural index available:
- 4,231 classes
- 47,892 functions  
- 12 main modules: BRepBuilderAPI, BRepAlgoAPI, ...

User question: "Where are edges calculated?"

Available tools:
- search(query) → relevant chunks
- get_chunk(id) → full chunk content
- get_structure(path) → module/class structure
- recursive_query(sub_question) → ask yourself about a subset

RLM then:

Searches for "edge" in symbol table
Finds BRepBuilderAPI_MakeEdge, BRepAlgo_EdgeConnector, etc.
Recursively queries: "What does BRepBuilderAPI_MakeEdge do?"
Retrieves relevant chunks, synthesizes answer

3. Response

{
  "answer": "Edge calculation in OpenCASCADE primarily happens in the BRepBuilderAPI module...",
  
  "sources": [
    {
      "file": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx",
      "lines": "142-387",
      "relevance": 0.95,
      "snippet": "void BRepBuilderAPI_MakeEdge::Build() { ... }"
    },
    {
      "file": "src/BRepAlgo/BRepAlgo_EdgeConnector.cxx", 
      "lines": "89-201",
      "relevance": 0.82,
      "snippet": "..."
    }
  ],
  
  "tokens_used": 47832,
  "chunks_examined": 127,
  "cost": "$0.34"
}

Storage (eXist-db)

Why eXist-db?

Versioning: Track codebase changes over time
XML-native: Fits with xml-pipeline philosophy
XQuery: Powerful querying for structured data
Efficient: Handles millions of documents

Schema

<library id="opencascade" version="v7.8.0">
  <metadata>
    <name>OpenCASCADE</name>
    <source>git:https://github.com/Open-Cascade-SAS/OCCT.git</source>
    <indexed_at>2026-01-26T08:00:00Z</indexed_at>
    <stats>
      <files>12847</files>
      <lines>2341892</lines>
      <chunks>89234</chunks>
    </stats>
  </metadata>
  
  <structure>
    <module name="BRepBuilderAPI">
      <class name="BRepBuilderAPI_MakeEdge">
        <function name="Build" file="..." lines="142-387"/>
        ...
      </class>
    </module>
  </structure>
  
  <chunks>
    <chunk id="..." file="..." type="function" name="Build">
      <content>...</content>
      <relations>
        <calls>GCPnts_TangentialDeflection</calls>
        <imports>gp_Pnt</imports>
      </relations>
    </chunk>
  </chunks>
</library>

Pricing (Premium Tier)

Operation	Cost
Ingest (per 1M tokens)	$2.00
Index (per library)	$5.00 - $50.00 (depends on size)
Query (per query)	$0.10 - $2.00 (depends on complexity)
Storage (per GB/month)	$0.50

Why premium:

Compute-intensive (lots of LLM calls)
Storage-intensive (versioned codebases)
High value (saves weeks of manual exploration)

Use Cases

1. Legacy Code Understanding

"I inherited a 500K line Fortran codebase. Help me understand the data flow."

2. API Discovery

"How do I create a NURBS surface with specific knot vectors?"

3. Impact Analysis

"What would break if I deprecated this function?"

4. Onboarding

"Explain the architecture of this codebase to a new developer."

5. Code Review Assistance

"Does this change follow the patterns used elsewhere in the codebase?"

Implementation Phases

Phase 1: MVP

Basic ingest (git clone, tarball upload)
Code chunker (Python, JavaScript, C++)
eXist-db storage
Simple RLM query (search + retrieve)

Phase 2: Full Chunkers

Prose chunker
Structured chunker (YAML/JSON/XML)
Tabular chunker
WASM chunker SDK

Phase 3: Deep Indexing

Call graph extraction
Type hierarchy
Cross-reference index
Incremental re-indexing on changes

Phase 4: Advanced Queries

Multi-turn conversations about code
"What if" analysis
Code generation informed by codebase patterns

"Finally understand the codebase you inherited."

10 KiB Raw Blame History