Comprehensive documentation set for XWiki: - Home, Installation, Quick Start guides - Writing Handlers and LLM Router guides - Architecture docs (Overview, Message Pump, Thread Registry, Shared Backend) - Reference docs (Configuration, Handler Contract, CLI) - Hello World tutorial - Why XML rationale - Pandoc conversion scripts (bash + PowerShell) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
303 lines
7.6 KiB
Markdown
303 lines
7.6 KiB
Markdown
# LLM Router
|
|
|
|
The LLM Router provides a unified interface for language model calls. Agents request a model by name; the router handles backend selection, failover, rate limiting, and retries.
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Agent Handler │
|
|
│ response = await complete("grok-4.1", messages) │
|
|
└─────────────────────────────────┬───────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ LLM Router │
|
|
│ • Find backends serving model │
|
|
│ • Select backend (strategy) │
|
|
│ • Retry on failure │
|
|
│ • Track usage per agent │
|
|
└────────────┬────────────────┬────────────────┬──────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ XAI │ │Anthropic │ │ Ollama │
|
|
│ Backend │ │ Backend │ │ Backend │
|
|
└──────────┘ └──────────┘ └──────────┘
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from xml_pipeline.platform.llm_api import complete
|
|
|
|
response = await complete(
|
|
model="grok-4.1",
|
|
messages=[
|
|
{"role": "system", "content": "You are helpful."},
|
|
{"role": "user", "content": "Hello!"},
|
|
],
|
|
)
|
|
|
|
print(response.content)
|
|
```
|
|
|
|
### In a Handler
|
|
|
|
```python
|
|
async def my_agent(payload: Query, metadata: HandlerMetadata) -> HandlerResponse:
|
|
from xml_pipeline.platform.llm_api import complete
|
|
|
|
response = await complete(
|
|
model="grok-4.1",
|
|
messages=[
|
|
{"role": "system", "content": metadata.usage_instructions},
|
|
{"role": "user", "content": payload.question},
|
|
],
|
|
temperature=0.7,
|
|
max_tokens=2048,
|
|
)
|
|
|
|
return HandlerResponse(
|
|
payload=Answer(text=response.content),
|
|
to="output",
|
|
)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### organism.yaml
|
|
|
|
```yaml
|
|
llm:
|
|
strategy: failover # Backend selection strategy
|
|
retries: 3 # Max retry attempts
|
|
retry_base_delay: 1.0 # Base delay for backoff
|
|
retry_max_delay: 60.0 # Max delay between retries
|
|
|
|
backends:
|
|
- provider: xai
|
|
api_key_env: XAI_API_KEY
|
|
priority: 1 # Lower = preferred
|
|
rate_limit_tpm: 100000 # Tokens per minute
|
|
max_concurrent: 20 # Concurrent request limit
|
|
|
|
- provider: anthropic
|
|
api_key_env: ANTHROPIC_API_KEY
|
|
priority: 2
|
|
|
|
- provider: openai
|
|
api_key_env: OPENAI_API_KEY
|
|
priority: 3
|
|
|
|
- provider: ollama
|
|
base_url: http://localhost:11434
|
|
supported_models: [llama3, mistral]
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
```env
|
|
# .env file
|
|
XAI_API_KEY=xai-abc123...
|
|
ANTHROPIC_API_KEY=sk-ant-...
|
|
OPENAI_API_KEY=sk-...
|
|
```
|
|
|
|
## Supported Providers
|
|
|
|
| Provider | Models | Auth |
|
|
|----------|--------|------|
|
|
| `xai` | grok-* | Bearer token |
|
|
| `anthropic` | claude-* | x-api-key header |
|
|
| `openai` | gpt-*, o1-*, o3-* | Bearer token |
|
|
| `ollama` | Any local model | None (local) |
|
|
|
|
### Model Routing
|
|
|
|
The router automatically selects backends based on model name:
|
|
|
|
- `grok-4.1` → XAI backend
|
|
- `claude-sonnet-4` → Anthropic backend
|
|
- `gpt-4o` → OpenAI backend
|
|
- `llama3` → Ollama (if in `supported_models`)
|
|
|
|
## Strategies
|
|
|
|
### Failover (Default)
|
|
|
|
Tries backends in priority order. Falls back on error.
|
|
|
|
```yaml
|
|
llm:
|
|
strategy: failover
|
|
backends:
|
|
- provider: xai
|
|
priority: 1 # Try first
|
|
- provider: anthropic
|
|
priority: 2 # Fallback
|
|
```
|
|
|
|
### Round-Robin
|
|
|
|
Distributes requests evenly across backends.
|
|
|
|
```yaml
|
|
llm:
|
|
strategy: round-robin
|
|
```
|
|
|
|
### Least-Loaded
|
|
|
|
Routes to the backend with lowest current load.
|
|
|
|
```yaml
|
|
llm:
|
|
strategy: least-loaded
|
|
```
|
|
|
|
## Response Format
|
|
|
|
```python
|
|
@dataclass
|
|
class LLMResponse:
|
|
content: str # Generated text
|
|
model: str # Model used
|
|
usage: Dict[str, int] # Token counts
|
|
finish_reason: str # stop, length, tool_calls
|
|
raw: Any # Provider-specific response
|
|
```
|
|
|
|
### Usage Dict
|
|
|
|
```python
|
|
response.usage = {
|
|
"prompt_tokens": 150,
|
|
"completion_tokens": 50,
|
|
"total_tokens": 200,
|
|
}
|
|
```
|
|
|
|
## Parameters
|
|
|
|
```python
|
|
response = await complete(
|
|
model="grok-4.1", # Required: model name
|
|
messages=[...], # Required: conversation
|
|
temperature=0.7, # Optional: randomness (0-2)
|
|
max_tokens=2048, # Optional: response limit
|
|
top_p=0.9, # Optional: nucleus sampling
|
|
stop=["END"], # Optional: stop sequences
|
|
)
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Rate Limits
|
|
|
|
On 429 responses:
|
|
1. Reads `Retry-After` header
|
|
2. Falls back to exponential backoff with jitter
|
|
3. Tries next backend (if failover)
|
|
|
|
### Provider Errors
|
|
|
|
On 5xx responses:
|
|
1. Logs error
|
|
2. Retries with backoff
|
|
3. Tries next backend (if failover)
|
|
|
|
### All Backends Failed
|
|
|
|
```python
|
|
from xml_pipeline.llm.router import BackendError
|
|
|
|
try:
|
|
response = await complete(model, messages)
|
|
except BackendError as e:
|
|
# All backends failed
|
|
logger.error(f"LLM call failed: {e}")
|
|
```
|
|
|
|
## Rate Limiting
|
|
|
|
Each backend has independent limits:
|
|
|
|
- **Token bucket**: Limits tokens per minute (`rate_limit_tpm`)
|
|
- **Semaphore**: Limits concurrent requests (`max_concurrent`)
|
|
|
|
Requests wait if limits are reached.
|
|
|
|
## Token Tracking
|
|
|
|
Track usage per agent:
|
|
|
|
```python
|
|
from xml_pipeline.llm.router import get_router
|
|
|
|
router = get_router()
|
|
|
|
# Get usage for agent
|
|
usage = router.get_agent_usage("greeter")
|
|
print(f"Total tokens: {usage.total_tokens}")
|
|
print(f"Requests: {usage.request_count}")
|
|
|
|
# Reset tracking
|
|
router.reset_agent_usage("greeter")
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Use System Prompts
|
|
|
|
```python
|
|
response = await complete(
|
|
model="grok-4.1",
|
|
messages=[
|
|
{"role": "system", "content": metadata.usage_instructions},
|
|
{"role": "user", "content": payload.query},
|
|
],
|
|
)
|
|
```
|
|
|
|
### 2. Handle Errors Gracefully
|
|
|
|
```python
|
|
try:
|
|
response = await complete(model, messages)
|
|
except BackendError:
|
|
return HandlerResponse(
|
|
payload=ErrorResponse(message="LLM unavailable"),
|
|
to=metadata.from_id,
|
|
)
|
|
```
|
|
|
|
### 3. Set Appropriate Limits
|
|
|
|
```yaml
|
|
llm:
|
|
backends:
|
|
- provider: xai
|
|
rate_limit_tpm: 50000 # Conservative limit
|
|
max_concurrent: 10 # Prevent overload
|
|
```
|
|
|
|
### 4. Use Failover for Reliability
|
|
|
|
```yaml
|
|
llm:
|
|
strategy: failover
|
|
backends:
|
|
- provider: xai
|
|
priority: 1
|
|
- provider: anthropic
|
|
priority: 2 # Backup
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [[Writing Handlers]] — Using LLM in handlers
|
|
- [[Configuration]] — Full LLM configuration
|
|
- [[Architecture Overview]] — System architecture
|