Comprehensive documentation set for XWiki: - Home, Installation, Quick Start guides - Writing Handlers and LLM Router guides - Architecture docs (Overview, Message Pump, Thread Registry, Shared Backend) - Reference docs (Configuration, Handler Contract, CLI) - Hello World tutorial - Why XML rationale - Pandoc conversion scripts (bash + PowerShell) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.6 KiB
7.6 KiB
LLM Router
The LLM Router provides a unified interface for language model calls. Agents request a model by name; the router handles backend selection, failover, rate limiting, and retries.
Overview
┌─────────────────────────────────────────────────────────────────┐
│ Agent Handler │
│ response = await complete("grok-4.1", messages) │
└─────────────────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ LLM Router │
│ • Find backends serving model │
│ • Select backend (strategy) │
│ • Retry on failure │
│ • Track usage per agent │
└────────────┬────────────────┬────────────────┬──────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ XAI │ │Anthropic │ │ Ollama │
│ Backend │ │ Backend │ │ Backend │
└──────────┘ └──────────┘ └──────────┘
Quick Start
Basic Usage
from xml_pipeline.platform.llm_api import complete
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"},
],
)
print(response.content)
In a Handler
async def my_agent(payload: Query, metadata: HandlerMetadata) -> HandlerResponse:
from xml_pipeline.platform.llm_api import complete
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": metadata.usage_instructions},
{"role": "user", "content": payload.question},
],
temperature=0.7,
max_tokens=2048,
)
return HandlerResponse(
payload=Answer(text=response.content),
to="output",
)
Configuration
organism.yaml
llm:
strategy: failover # Backend selection strategy
retries: 3 # Max retry attempts
retry_base_delay: 1.0 # Base delay for backoff
retry_max_delay: 60.0 # Max delay between retries
backends:
- provider: xai
api_key_env: XAI_API_KEY
priority: 1 # Lower = preferred
rate_limit_tpm: 100000 # Tokens per minute
max_concurrent: 20 # Concurrent request limit
- provider: anthropic
api_key_env: ANTHROPIC_API_KEY
priority: 2
- provider: openai
api_key_env: OPENAI_API_KEY
priority: 3
- provider: ollama
base_url: http://localhost:11434
supported_models: [llama3, mistral]
Environment Variables
# .env file
XAI_API_KEY=xai-abc123...
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
Supported Providers
| Provider | Models | Auth |
|---|---|---|
xai |
grok-* | Bearer token |
anthropic |
claude-* | x-api-key header |
openai |
gpt-, o1-, o3-* | Bearer token |
ollama |
Any local model | None (local) |
Model Routing
The router automatically selects backends based on model name:
grok-4.1→ XAI backendclaude-sonnet-4→ Anthropic backendgpt-4o→ OpenAI backendllama3→ Ollama (if insupported_models)
Strategies
Failover (Default)
Tries backends in priority order. Falls back on error.
llm:
strategy: failover
backends:
- provider: xai
priority: 1 # Try first
- provider: anthropic
priority: 2 # Fallback
Round-Robin
Distributes requests evenly across backends.
llm:
strategy: round-robin
Least-Loaded
Routes to the backend with lowest current load.
llm:
strategy: least-loaded
Response Format
@dataclass
class LLMResponse:
content: str # Generated text
model: str # Model used
usage: Dict[str, int] # Token counts
finish_reason: str # stop, length, tool_calls
raw: Any # Provider-specific response
Usage Dict
response.usage = {
"prompt_tokens": 150,
"completion_tokens": 50,
"total_tokens": 200,
}
Parameters
response = await complete(
model="grok-4.1", # Required: model name
messages=[...], # Required: conversation
temperature=0.7, # Optional: randomness (0-2)
max_tokens=2048, # Optional: response limit
top_p=0.9, # Optional: nucleus sampling
stop=["END"], # Optional: stop sequences
)
Error Handling
Rate Limits
On 429 responses:
- Reads
Retry-Afterheader - Falls back to exponential backoff with jitter
- Tries next backend (if failover)
Provider Errors
On 5xx responses:
- Logs error
- Retries with backoff
- Tries next backend (if failover)
All Backends Failed
from xml_pipeline.llm.router import BackendError
try:
response = await complete(model, messages)
except BackendError as e:
# All backends failed
logger.error(f"LLM call failed: {e}")
Rate Limiting
Each backend has independent limits:
- Token bucket: Limits tokens per minute (
rate_limit_tpm) - Semaphore: Limits concurrent requests (
max_concurrent)
Requests wait if limits are reached.
Token Tracking
Track usage per agent:
from xml_pipeline.llm.router import get_router
router = get_router()
# Get usage for agent
usage = router.get_agent_usage("greeter")
print(f"Total tokens: {usage.total_tokens}")
print(f"Requests: {usage.request_count}")
# Reset tracking
router.reset_agent_usage("greeter")
Best Practices
1. Use System Prompts
response = await complete(
model="grok-4.1",
messages=[
{"role": "system", "content": metadata.usage_instructions},
{"role": "user", "content": payload.query},
],
)
2. Handle Errors Gracefully
try:
response = await complete(model, messages)
except BackendError:
return HandlerResponse(
payload=ErrorResponse(message="LLM unavailable"),
to=metadata.from_id,
)
3. Set Appropriate Limits
llm:
backends:
- provider: xai
rate_limit_tpm: 50000 # Conservative limit
max_concurrent: 10 # Prevent overload
4. Use Failover for Reliability
llm:
strategy: failover
backends:
- provider: xai
priority: 1
- provider: anthropic
priority: 2 # Backup
See Also
- Writing Handlers — Using LLM in handlers
- Configuration — Full LLM configuration
- Architecture Overview — System architecture