xml-pipeline/docs/wiki/LLM-Router.md
dullfig 515c738abb Add wiki documentation for xml-pipeline.org
Comprehensive documentation set for XWiki:
- Home, Installation, Quick Start guides
- Writing Handlers and LLM Router guides
- Architecture docs (Overview, Message Pump, Thread Registry, Shared Backend)
- Reference docs (Configuration, Handler Contract, CLI)
- Hello World tutorial
- Why XML rationale
- Pandoc conversion scripts (bash + PowerShell)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 20:40:47 -08:00

303 lines
7.6 KiB
Markdown

# LLM Router
The LLM Router provides a unified interface for language model calls. Agents request a model by name; the router handles backend selection, failover, rate limiting, and retries.
## Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Agent Handler │
│ response = await complete("grok-4.1", messages) │
└─────────────────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ LLM Router │
│ • Find backends serving model │
│ • Select backend (strategy) │
│ • Retry on failure │
│ • Track usage per agent │
└────────────┬────────────────┬────────────────┬──────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ XAI │ │Anthropic │ │ Ollama │
│ Backend │ │ Backend │ │ Backend │
└──────────┘ └──────────┘ └──────────┘
```
## Quick Start
### Basic Usage
```python
from xml_pipeline.platform.llm_api import complete
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
],
)
print(response.content)
```
### In a Handler
```python
async def my_agent(payload: Query, metadata: HandlerMetadata) -> HandlerResponse:
from xml_pipeline.platform.llm_api import complete
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": metadata.usage_instructions},
{"role": "user", "content": payload.question},
],
temperature=0.7,
max_tokens=2048,
)
return HandlerResponse(
payload=Answer(text=response.content),
to="output",
)
```
## Configuration
### organism.yaml
```yaml
llm:
strategy: failover # Backend selection strategy
retries: 3 # Max retry attempts
retry_base_delay: 1.0 # Base delay for backoff
retry_max_delay: 60.0 # Max delay between retries
backends:
- provider: xai
api_key_env: XAI_API_KEY
priority: 1 # Lower = preferred
rate_limit_tpm: 100000 # Tokens per minute
max_concurrent: 20 # Concurrent request limit
- provider: anthropic
api_key_env: ANTHROPIC_API_KEY
priority: 2
- provider: openai
api_key_env: OPENAI_API_KEY
priority: 3
- provider: ollama
base_url: http://localhost:11434
supported_models: [llama3, mistral]
```
### Environment Variables
```env
# .env file
XAI_API_KEY=xai-abc123...
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
```
## Supported Providers
| Provider | Models | Auth |
|----------|--------|------|
| `xai` | grok-* | Bearer token |
| `anthropic` | claude-* | x-api-key header |
| `openai` | gpt-*, o1-*, o3-* | Bearer token |
| `ollama` | Any local model | None (local) |
### Model Routing
The router automatically selects backends based on model name:
- `grok-4.1` → XAI backend
- `claude-sonnet-4` → Anthropic backend
- `gpt-4o` → OpenAI backend
- `llama3` → Ollama (if in `supported_models`)
## Strategies
### Failover (Default)
Tries backends in priority order. Falls back on error.
```yaml
llm:
strategy: failover
backends:
- provider: xai
priority: 1 # Try first
- provider: anthropic
priority: 2 # Fallback
```
### Round-Robin
Distributes requests evenly across backends.
```yaml
llm:
strategy: round-robin
```
### Least-Loaded
Routes to the backend with lowest current load.
```yaml
llm:
strategy: least-loaded
```
## Response Format
```python
@dataclass
class LLMResponse:
content: str # Generated text
model: str # Model used
usage: Dict[str, int] # Token counts
finish_reason: str # stop, length, tool_calls
raw: Any # Provider-specific response
```
### Usage Dict
```python
response.usage = {
"prompt_tokens": 150,
"completion_tokens": 50,
"total_tokens": 200,
}
```
## Parameters
```python
response = await complete(
model="grok-4.1", # Required: model name
messages=[...], # Required: conversation
temperature=0.7, # Optional: randomness (0-2)
max_tokens=2048, # Optional: response limit
top_p=0.9, # Optional: nucleus sampling
stop=["END"], # Optional: stop sequences
)
```
## Error Handling
### Rate Limits
On 429 responses:
1. Reads `Retry-After` header
2. Falls back to exponential backoff with jitter
3. Tries next backend (if failover)
### Provider Errors
On 5xx responses:
1. Logs error
2. Retries with backoff
3. Tries next backend (if failover)
### All Backends Failed
```python
from xml_pipeline.llm.router import BackendError
try:
response = await complete(model, messages)
except BackendError as e:
# All backends failed
logger.error(f"LLM call failed: {e}")
```
## Rate Limiting
Each backend has independent limits:
- **Token bucket**: Limits tokens per minute (`rate_limit_tpm`)
- **Semaphore**: Limits concurrent requests (`max_concurrent`)
Requests wait if limits are reached.
## Token Tracking
Track usage per agent:
```python
from xml_pipeline.llm.router import get_router
router = get_router()
# Get usage for agent
usage = router.get_agent_usage("greeter")
print(f"Total tokens: {usage.total_tokens}")
print(f"Requests: {usage.request_count}")
# Reset tracking
router.reset_agent_usage("greeter")
```
## Best Practices
### 1. Use System Prompts
```python
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": metadata.usage_instructions},
{"role": "user", "content": payload.query},
],
)
```
### 2. Handle Errors Gracefully
```python
try:
response = await complete(model, messages)
except BackendError:
return HandlerResponse(
payload=ErrorResponse(message="LLM unavailable"),
to=metadata.from_id,
)
```
### 3. Set Appropriate Limits
```yaml
llm:
backends:
- provider: xai
rate_limit_tpm: 50000 # Conservative limit
max_concurrent: 10 # Prevent overload
```
### 4. Use Failover for Reliability
```yaml
llm:
strategy: failover
backends:
- provider: xai
priority: 1
- provider: anthropic
priority: 2 # Backup
```
## See Also
- [[Writing Handlers]] — Using LLM in handlers
- [[Configuration]] — Full LLM configuration
- [[Architecture Overview]] — System architecture