Document Premium Librarian in CLAUDE.md
Add comprehensive documentation for the codebase intelligence system: - Architecture diagram showing ingest → chunk → store → query flow - Component table with module locations and purposes - Usage examples for ingest_git_repo and query_library - Supported languages table (Python, JS/TS, C++, etc.) - Message payloads reference - Requirements and installation Also updates: - Project structure to include librarian/ package - Key modules table with PremiumLibrarian entry - Optional dependencies with [librarian] extra Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
ce8a9ae0e7
commit
d6a0efc064
1 changed files with 83 additions and 0 deletions
83
CLAUDE.md
83
CLAUDE.md
|
|
@ -77,6 +77,7 @@ xml-pipeline/
|
||||||
│ ├── prompts/ # System prompts (no_paperclippers, etc.)
|
│ ├── prompts/ # System prompts (no_paperclippers, etc.)
|
||||||
│ ├── schema/ # XSD schema files
|
│ ├── schema/ # XSD schema files
|
||||||
│ ├── tools/ # Native tools (files, shell, search, etc.)
|
│ ├── tools/ # Native tools (files, shell, search, etc.)
|
||||||
|
│ ├── librarian/ # Premium Librarian (codebase intelligence)
|
||||||
│ └── utils/ # Shared utilities
|
│ └── utils/ # Shared utilities
|
||||||
├── config/ # Example organism configurations
|
├── config/ # Example organism configurations
|
||||||
├── docs/ # Architecture and design docs
|
├── docs/ # Architecture and design docs
|
||||||
|
|
@ -131,6 +132,7 @@ AgentServer implements a stream-based message pump where all communication flows
|
||||||
| UsageTracker | `xml_pipeline/llm/usage_tracker.py` | Production billing and gas usage metering |
|
| UsageTracker | `xml_pipeline/llm/usage_tracker.py` | Production billing and gas usage metering |
|
||||||
| PromptRegistry | `xml_pipeline/platform/prompt_registry.py` | Immutable system prompt storage |
|
| PromptRegistry | `xml_pipeline/platform/prompt_registry.py` | Immutable system prompt storage |
|
||||||
| ContextBuffer | `xml_pipeline/memory/context_buffer.py` | Conversation history per thread |
|
| ContextBuffer | `xml_pipeline/memory/context_buffer.py` | Conversation history per thread |
|
||||||
|
| PremiumLibrarian | `xml_pipeline/librarian/` | Codebase intelligence with RAG |
|
||||||
|
|
||||||
## Development Guidelines
|
## Development Guidelines
|
||||||
|
|
||||||
|
|
@ -362,6 +364,7 @@ pip install xml-pipeline[openai] # OpenAI SDK
|
||||||
# Tool backends
|
# Tool backends
|
||||||
pip install xml-pipeline[redis] # Distributed key-value store
|
pip install xml-pipeline[redis] # Distributed key-value store
|
||||||
pip install xml-pipeline[search] # DuckDuckGo search
|
pip install xml-pipeline[search] # DuckDuckGo search
|
||||||
|
pip install xml-pipeline[librarian] # Codebase intelligence (GitPython)
|
||||||
|
|
||||||
# Console example
|
# Console example
|
||||||
pip install xml-pipeline[console] # prompt_toolkit for examples
|
pip install xml-pipeline[console] # prompt_toolkit for examples
|
||||||
|
|
@ -402,6 +405,86 @@ Built-in message types in `xml_pipeline/primitives/`:
|
||||||
| `TextInput` | User text input from console |
|
| `TextInput` | User text input from console |
|
||||||
| `TextOutput` | Text output to console |
|
| `TextOutput` | Text output to console |
|
||||||
|
|
||||||
|
## Premium Librarian (Codebase Intelligence)
|
||||||
|
|
||||||
|
The Premium Librarian provides RLM-powered codebase intelligence using RAG (Retrieval-Augmented Generation). It ingests codebases, chunks them intelligently, stores in eXist-db, and answers natural language queries.
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌─────────────┐ ┌─────────────────┐
|
||||||
|
│ Ingest │───▶│ Chunker │───▶│ eXist-db │
|
||||||
|
│ (git/tar) │ │ (AST-based)│ │ (XML storage) │
|
||||||
|
└─────────────┘ └─────────────┘ └────────┬────────┘
|
||||||
|
│
|
||||||
|
┌─────────────┐ ┌─────────────┐ ┌────────▼────────┐
|
||||||
|
│ Query │───▶│ RAG Search │───▶│ Online LLM │
|
||||||
|
│ (natural │ │ (XQuery + │ │ (synthesis) │
|
||||||
|
│ language) │ │ Lucene) │ └─────────────────┘
|
||||||
|
└─────────────┘ └─────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
| Module | Location | Purpose |
|
||||||
|
|--------|----------|---------|
|
||||||
|
| Chunker | `xml_pipeline/librarian/chunker.py` | AST-based code chunking (Python, JS, C++) |
|
||||||
|
| Ingest | `xml_pipeline/librarian/ingest.py` | Git clone + file walking + storage |
|
||||||
|
| Index | `xml_pipeline/librarian/index.py` | Structural index (files, functions, classes) |
|
||||||
|
| Query | `xml_pipeline/librarian/query.py` | RAG search + LLM synthesis |
|
||||||
|
| Primitives | `xml_pipeline/librarian/primitives.py` | XML payloads for message bus |
|
||||||
|
| Handler | `xml_pipeline/librarian/handler.py` | Message handlers |
|
||||||
|
|
||||||
|
### Usage
|
||||||
|
|
||||||
|
```python
|
||||||
|
from xml_pipeline.librarian import ingest_git_repo, query_library
|
||||||
|
|
||||||
|
# Ingest a codebase
|
||||||
|
result = await ingest_git_repo(
|
||||||
|
url="https://github.com/example/repo.git",
|
||||||
|
branch="main",
|
||||||
|
library_name="my-lib",
|
||||||
|
)
|
||||||
|
print(f"Ingested {result.files_processed} files, {result.chunks_created} chunks")
|
||||||
|
|
||||||
|
# Query the library
|
||||||
|
answer = await query_library(
|
||||||
|
library_id=result.library_id,
|
||||||
|
question="How does the authentication system work?",
|
||||||
|
)
|
||||||
|
print(answer.answer)
|
||||||
|
for source in answer.sources:
|
||||||
|
print(f" - {source.file_path}:{source.start_line}-{source.end_line}")
|
||||||
|
```
|
||||||
|
|
||||||
|
### Supported Languages
|
||||||
|
|
||||||
|
| Language | Chunking Method | Features |
|
||||||
|
|----------|-----------------|----------|
|
||||||
|
| Python | AST-based | Functions, classes, methods, imports, docstrings |
|
||||||
|
| JavaScript/TypeScript | Regex-based | Functions, arrow functions, classes, JSDoc |
|
||||||
|
| C/C++ | Regex-based | Functions, classes, structs, Doxygen |
|
||||||
|
| Markdown/RST | Heading-based | Sections by headings |
|
||||||
|
| Other | Line-based | Generic chunking with smart breaks |
|
||||||
|
|
||||||
|
### Message Payloads
|
||||||
|
|
||||||
|
| Payload | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `LibrarianIngest` | Request to ingest a git repository |
|
||||||
|
| `LibrarianIngested` | Response with library_id and stats |
|
||||||
|
| `LibrarianQuery` | Natural language query request |
|
||||||
|
| `LibrarianAnswer` | Answer with sources and token usage |
|
||||||
|
| `LibrarianList` | Request to list all libraries |
|
||||||
|
| `LibrarianLibraries` | Response with library list |
|
||||||
|
|
||||||
|
### Requirements
|
||||||
|
|
||||||
|
- eXist-db running for chunk storage
|
||||||
|
- LLM router configured for query synthesis
|
||||||
|
- Install with: `pip install xml-pipeline[librarian]`
|
||||||
|
|
||||||
## Additional Resources
|
## Additional Resources
|
||||||
|
|
||||||
- @docs/core-principles-v2.1.md — Single source of truth for architecture
|
- @docs/core-principles-v2.1.md — Single source of truth for architecture
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue