diff --git a/docs/premium-librarian-spec.md b/docs/premium-librarian-spec.md new file mode 100644 index 0000000..2745789 --- /dev/null +++ b/docs/premium-librarian-spec.md @@ -0,0 +1,323 @@ +# Premium Librarian — RLM-Powered Codebase Intelligence + +**Status:** Design Spec +**Author:** Dan & Donna +**Date:** 2026-01-26 + +## Overview + +The Premium Librarian is an RLM-powered (Recursive Language Model) tool that can ingest entire codebases or document corpora — millions of tokens — and answer natural language queries about them. + +Unlike traditional code search (grep, ripgrep, Sourcegraph), the Premium Librarian *understands* structure, relationships, and intent. It can answer questions like: + +- "Where are edges calculated in OpenCASCADE?" +- "How does the authentication flow work?" +- "What would break if I changed this interface?" + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ PREMIUM LIBRARIAN │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ +│ │ Ingest │───▶│ Chunker │───▶│ eXist-db │ │ +│ │ (upload) │ │ (4 types │ │ (versioned storage) │ │ +│ │ │ │ + WASM) │ │ │ │ +│ └─────────────┘ └─────────────┘ └───────────┬─────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ +│ │ Query │───▶│ RLM │───▶│ Index / Map │ │ +│ │ (natural │ │ Processor │ │ (structure, relations) │ │ +│ │ language) │ │ │ │ │ │ +│ └─────────────┘ └──────┬──────┘ └─────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌─────────────┐ │ +│ │ Response │ │ +│ │ (answer + │ │ +│ │ sources) │ │ +│ └─────────────┘ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## Chunkers + +The RLM algorithm needs content-aware chunking. Different content types have different natural boundaries. + +### Built-in Chunkers + +| Chunker | Content Types | Chunking Strategy | +|---------|---------------|-------------------| +| **Code** | .py, .js, .ts, .cpp, .c, .java, .go, .rs, etc. | Functions, classes, modules. Preserves imports, docstrings, signatures. | +| **Prose** | .md, .txt, .rst, .adoc | Paragraphs, sections, chapters. Preserves headings, structure. | +| **Structured** | .yaml, .json, .xml, .toml, .ini | Schema-aware. Preserves hierarchy, keys, nesting. | +| **Tabular** | .csv, .tsv, .parquet | Row groups with headers. Preserves column semantics. | + +### Content Type Detection + +1. File extension mapping (fast path) +2. MIME type detection +3. Content sniffing (magic bytes, heuristics) +4. User override via config + +### Custom WASM Chunkers + +For specialized formats, users can provide their own chunker: + +```typescript +// chunker.ts (AssemblyScript) +import { Chunk, ChunkMetadata } from "@openblox/chunker-sdk"; + +export function chunk(content: string, metadata: ChunkMetadata): Chunk[] { + // Custom logic for proprietary format + const chunks: Chunk[] = []; + + // Parse your format, emit chunks + // Each chunk has: content, startLine, endLine, type, name + + return chunks; +} +``` + +**Compile & upload:** +```bash +asc chunker.ts -o chunker.wasm +# Upload via API or CLI +``` + +**Security:** +- WASM runs sandboxed (can't escape, can't access filesystem) +- CPU/memory limits enforced +- Chunker is pure function: string in, chunks out + +## Ingest Pipeline + +### 1. Upload + +``` +POST /api/v1/libraries +{ + "name": "opencascade", + "source": { + "type": "git", + "url": "https://github.com/Open-Cascade-SAS/OCCT.git", + "branch": "master" + }, + // OR + "source": { + "type": "upload", + "archive": "" + } +} +``` + +### 2. Chunking + +For each file: +1. Detect content type +2. Select chunker (built-in or custom WASM) +3. Chunk content +4. Store chunks in eXist-db with metadata + +**Chunk metadata:** +```json +{ + "id": "chunk-uuid", + "library_id": "lib-uuid", + "file_path": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx", + "start_line": 142, + "end_line": 387, + "type": "function", + "name": "BRepBuilderAPI_MakeEdge::Build", + "language": "cpp", + "imports": ["gp_Pnt", "TopoDS_Edge", "BRepLib"], + "calls": ["GCPnts_TangentialDeflection", "BRep_Builder"], + "version": "v7.8.0", + "indexed_at": "2026-01-26T..." +} +``` + +### 3. Indexing (Background) + +After chunking, RLM builds structural index: + +- **Call graph**: What calls what +- **Type hierarchy**: Classes, inheritance +- **Module map**: How code is organized +- **Symbol table**: Functions, classes, constants +- **Dependency graph**: Imports, includes + +This runs as a background job (can take hours for large codebases). + +## Query Pipeline + +### 1. Query + +``` +POST /api/v1/libraries/{id}/query +{ + "question": "Where are edges calculated?", + "max_tokens": 8000, + "include_sources": true +} +``` + +### 2. RLM Processing + +The RLM receives: +``` +You have access to a library "opencascade" with 2.3M lines of C++ code. + +Structural index available: +- 4,231 classes +- 47,892 functions +- 12 main modules: BRepBuilderAPI, BRepAlgoAPI, ... + +User question: "Where are edges calculated?" + +Available tools: +- search(query) → relevant chunks +- get_chunk(id) → full chunk content +- get_structure(path) → module/class structure +- recursive_query(sub_question) → ask yourself about a subset +``` + +RLM then: +1. Searches for "edge" in symbol table +2. Finds `BRepBuilderAPI_MakeEdge`, `BRepAlgo_EdgeConnector`, etc. +3. Recursively queries: "What does BRepBuilderAPI_MakeEdge do?" +4. Retrieves relevant chunks, synthesizes answer + +### 3. Response + +```json +{ + "answer": "Edge calculation in OpenCASCADE primarily happens in the BRepBuilderAPI module...", + + "sources": [ + { + "file": "src/BRepBuilderAPI/BRepBuilderAPI_MakeEdge.cxx", + "lines": "142-387", + "relevance": 0.95, + "snippet": "void BRepBuilderAPI_MakeEdge::Build() { ... }" + }, + { + "file": "src/BRepAlgo/BRepAlgo_EdgeConnector.cxx", + "lines": "89-201", + "relevance": 0.82, + "snippet": "..." + } + ], + + "tokens_used": 47832, + "chunks_examined": 127, + "cost": "$0.34" +} +``` + +## Storage (eXist-db) + +### Why eXist-db? + +- **Versioning**: Track codebase changes over time +- **XML-native**: Fits with xml-pipeline philosophy +- **XQuery**: Powerful querying for structured data +- **Efficient**: Handles millions of documents + +### Schema + +```xml + + + OpenCASCADE + git:https://github.com/Open-Cascade-SAS/OCCT.git + 2026-01-26T08:00:00Z + + 12847 + 2341892 + 89234 + + + + + + + + ... + + + + + + + ... + + GCPnts_TangentialDeflection + gp_Pnt + + + + +``` + +## Pricing (Premium Tier) + +| Operation | Cost | +|-----------|------| +| Ingest (per 1M tokens) | $2.00 | +| Index (per library) | $5.00 - $50.00 (depends on size) | +| Query (per query) | $0.10 - $2.00 (depends on complexity) | +| Storage (per GB/month) | $0.50 | + +**Why premium:** +- Compute-intensive (lots of LLM calls) +- Storage-intensive (versioned codebases) +- High value (saves weeks of manual exploration) + +## Use Cases + +### 1. Legacy Code Understanding +"I inherited a 500K line Fortran codebase. Help me understand the data flow." + +### 2. API Discovery +"How do I create a NURBS surface with specific knot vectors?" + +### 3. Impact Analysis +"What would break if I deprecated this function?" + +### 4. Onboarding +"Explain the architecture of this codebase to a new developer." + +### 5. Code Review Assistance +"Does this change follow the patterns used elsewhere in the codebase?" + +## Implementation Phases + +### Phase 1: MVP +- [ ] Basic ingest (git clone, tarball upload) +- [ ] Code chunker (Python, JavaScript, C++) +- [ ] eXist-db storage +- [ ] Simple RLM query (search + retrieve) + +### Phase 2: Full Chunkers +- [ ] Prose chunker +- [ ] Structured chunker (YAML/JSON/XML) +- [ ] Tabular chunker +- [ ] WASM chunker SDK + +### Phase 3: Deep Indexing +- [ ] Call graph extraction +- [ ] Type hierarchy +- [ ] Cross-reference index +- [ ] Incremental re-indexing on changes + +### Phase 4: Advanced Queries +- [ ] Multi-turn conversations about code +- [ ] "What if" analysis +- [ ] Code generation informed by codebase patterns + +--- + +*"Finally understand the codebase you inherited."*