xml-pipeline/docs/bloxserver-llm-layer.md
dullfig a5c00c1e90 Add BloxServer API scaffold + architecture docs
BloxServer API (FastAPI + SQLAlchemy async):
- Database models: users, flows, triggers, executions, usage tracking
- Clerk JWT auth with dev mode bypass for local testing
- SQLite support for local dev, PostgreSQL for production
- CRUD routes for flows, triggers, executions
- Public webhook endpoint with token auth
- Health/readiness endpoints
- Pydantic schemas with camelCase aliases for frontend
- Docker + docker-compose setup

Architecture documentation:
- Librarian architecture with RLM-powered query engine
- Stripe billing integration (usage-based, trials, webhooks)
- LLM abstraction layer (rate limiting, semantic cache, failover)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 22:04:25 -08:00

30 KiB

BloxServer LLM Abstraction Layer — Resilient Multi-Provider Architecture

Status: Design Date: January 2026

Overview

The LLM abstraction layer is the critical path for all AI operations in BloxServer. It must handle:

  • Viral growth: 100 → 10,000 users overnight
  • Provider outages: Single provider down ≠ platform down
  • Fair access: Paid users prioritized, free users served fairly
  • Cost control: Platform keys vs BYOK (Bring Your Own Key)
  • Low latency: Sub-second for simple calls, reasonable for complex

This document specifies the defense-in-depth architecture that survives success.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     LLM Abstraction Layer                        │
│                                                                  │
│  Request → [Rate Limit] → [Cache Check] → [Queue] → [Dispatch]  │
│                │               │              │          │       │
│                ▼               ▼              ▼          ▼       │
│           Per-user        Semantic       Priority   Provider    │
│           per-tier        cache          queues     pool +      │
│           limits          (30%+ hits)    (by tier)  failover    │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ BYOK (Bring Your Own Key)                                   ││
│  │ Pro+ users with own API keys bypass platform limits         ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ High Frequency Tier                                         ││
│  │ Dedicated capacity, custom SLA — contact sales              ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Tier Limits

Tier Price Requests/min Tokens/min Concurrent Latency SLA
Free $0 10 10,000 2 Best effort
Pro $29/mo 60 100,000 10 < 30s P95
Enterprise Custom 300 500,000 50 < 10s P95
High Frequency Custom Custom Custom Dedicated Custom SLA
BYOK (any tier) Unlimited* Unlimited* 20 User's provider

*BYOK users are limited only by their own provider's rate limits.

High Frequency Tier

For users requiring:

  • Low latency: Sub-second response times
  • High throughput: Thousands of requests per minute
  • Guaranteed capacity: Dedicated provider allocations
  • Custom models: Fine-tuned or private deployments

Use cases:

  • Real-time trading signals
  • Live customer support at scale
  • High-volume content generation
  • Latency-sensitive applications

Pricing: Custom — based on capacity reservation, SLA requirements, and volume.

Landing page CTA:

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  Need High Frequency?                                        │
│                                                              │
│  Building something that needs thousands of requests per     │
│  minute with sub-second latency? Let's talk dedicated        │
│  capacity and custom SLAs.                                   │
│                                                              │
│  [Contact Sales →]                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Layer 1: Intake Rate Limiting

First line of defense. Rejects requests before they consume resources.

Implementation

from dataclasses import dataclass
from enum import Enum
import time

class Tier(Enum):
    FREE = "free"
    PRO = "pro"
    ENTERPRISE = "enterprise"
    HIGH_FREQUENCY = "high_frequency"

@dataclass
class TierLimits:
    requests_per_minute: int
    tokens_per_minute: int
    max_concurrent: int

TIER_LIMITS = {
    Tier.FREE: TierLimits(10, 10_000, 2),
    Tier.PRO: TierLimits(60, 100_000, 10),
    Tier.ENTERPRISE: TierLimits(300, 500_000, 50),
    Tier.HIGH_FREQUENCY: TierLimits(10_000, 10_000_000, 500),  # Custom per customer
}

@dataclass
class RateLimitResult:
    allowed: bool
    use_user_key: bool = False
    retry_after: int | None = None
    reason: str | None = None
    concurrent_key: str | None = None

async def rate_limit_check(user: User, request: LLMRequest) -> RateLimitResult:
    """Check if user can make this request."""

    # BYOK users bypass platform limits
    if user.has_own_api_key(request.provider):
        return RateLimitResult(allowed=True, use_user_key=True)

    limits = TIER_LIMITS[user.tier]

    # Check requests per minute (sliding window)
    rpm_key = f"ratelimit:{user.id}:rpm"
    now = time.time()
    window_start = now - 60

    # Remove old entries, add new one, count
    pipe = redis.pipeline()
    pipe.zremrangebyscore(rpm_key, 0, window_start)
    pipe.zadd(rpm_key, {str(now): now})
    pipe.zcard(rpm_key)
    pipe.expire(rpm_key, 120)
    _, _, current_rpm, _ = await pipe.execute()

    if current_rpm > limits.requests_per_minute:
        return RateLimitResult(
            allowed=False,
            retry_after=int(60 - (now - window_start)),
            reason=f"Rate limit: {limits.requests_per_minute} requests/minute"
        )

    # Check concurrent requests
    concurrent_key = f"ratelimit:{user.id}:concurrent"
    current_concurrent = await redis.incr(concurrent_key)
    await redis.expire(concurrent_key, 300)  # 5 min TTL as safety

    if current_concurrent > limits.max_concurrent:
        await redis.decr(concurrent_key)
        return RateLimitResult(
            allowed=False,
            retry_after=1,
            reason=f"Max concurrent: {limits.max_concurrent} requests"
        )

    return RateLimitResult(allowed=True, concurrent_key=concurrent_key)

async def release_concurrent(concurrent_key: str):
    """Release concurrent slot after request completes."""
    if concurrent_key:
        await redis.decr(concurrent_key)

Rate Limit Headers

Return standard headers so clients can self-regulate:

def rate_limit_headers(user: User) -> dict:
    limits = TIER_LIMITS[user.tier]
    current = await get_current_usage(user.id)

    return {
        "X-RateLimit-Limit": str(limits.requests_per_minute),
        "X-RateLimit-Remaining": str(max(0, limits.requests_per_minute - current.rpm)),
        "X-RateLimit-Reset": str(int(time.time()) + 60),
    }

Layer 2: Semantic Cache

Identical requests return cached responses. Reduces load and cost.

Cache Key Generation

import hashlib
import json

def hash_request(request: LLMRequest) -> str:
    """Generate deterministic cache key for request."""

    # Include all parameters that affect output
    cache_input = {
        "model": request.model,
        "messages": [
            {"role": m.role, "content": m.content}
            for m in request.messages
        ],
        "temperature": request.temperature,
        "max_tokens": request.max_tokens,
        "tools": request.tools,  # Tool definitions matter
        # Exclude: user_id, timestamps, request_id
    }

    serialized = json.dumps(cache_input, sort_keys=True)
    return hashlib.sha256(serialized.encode()).hexdigest()[:32]

Cache Logic

@dataclass
class CachedResponse:
    response: LLMResponse
    cached_at: float
    hit_count: int

async def check_semantic_cache(request: LLMRequest) -> LLMResponse | None:
    """Check if we've seen this exact request before."""

    cache_key = f"llmcache:{hash_request(request)}"
    cached = await redis.get(cache_key)

    if cached:
        data = json.loads(cached)

        # Update hit count for analytics
        await redis.hincrby(f"llmcache:stats", "hits", 1)

        return LLMResponse(
            content=data["content"],
            model=data["model"],
            usage=data["usage"],
            cached=True,
        )

    await redis.hincrby(f"llmcache:stats", "misses", 1)
    return None

async def cache_response(request: LLMRequest, response: LLMResponse):
    """Cache response with TTL based on determinism."""

    # Don't cache errors or empty responses
    if response.error or not response.content:
        return

    cache_key = f"llmcache:{hash_request(request)}"

    # TTL based on temperature (determinism)
    if request.temperature == 0:
        ttl = 86400  # 24 hours for deterministic
    elif request.temperature < 0.3:
        ttl = 3600   # 1 hour
    elif request.temperature < 0.7:
        ttl = 300    # 5 minutes
    else:
        return       # Don't cache high-temperature responses

    cache_data = {
        "content": response.content,
        "model": response.model,
        "usage": response.usage,
        "cached_at": time.time(),
    }

    await redis.setex(cache_key, ttl, json.dumps(cache_data))

Expected Cache Performance

Use Case Temperature Expected Hit Rate
Tool calls (same inputs) 0 70-90%
Structured extraction 0-0.3 50-70%
Agent reasoning 0.5-0.7 20-40%
Creative content 0.8-1.0 ~0%

Aggregate impact: 30-40% reduction in API calls for typical workloads.

Layer 3: Priority Queues

Paid users get priority. Free users are served fairly but can be shed under load.

Queue Structure

# Redis sorted set with composite score
# Score = (priority * 1B) + timestamp
# Lower score = higher priority + earlier arrival

QUEUE_PRIORITIES = {
    Tier.HIGH_FREQUENCY: 0,  # Highest priority (dedicated customers)
    Tier.ENTERPRISE: 1,
    Tier.PRO: 2,
    "trial": 2,              # Trials get Pro priority (first impression)
    Tier.FREE: 3,            # Lowest priority
}

@dataclass
class QueuedRequest:
    ticket_id: str
    user_id: str
    tier: str
    request: LLMRequest
    enqueued_at: float
    use_user_key: bool = False

async def enqueue_request(user: User, request: LLMRequest, use_user_key: bool) -> str:
    """Add request to priority queue, return ticket ID."""

    ticket_id = f"ticket:{uuid.uuid4().hex}"
    priority = QUEUE_PRIORITIES.get(user.tier, 3)

    # Composite score: priority (billions) + timestamp (seconds)
    score = priority * 1_000_000_000 + time.time()

    queued = QueuedRequest(
        ticket_id=ticket_id,
        user_id=str(user.id),
        tier=user.tier,
        request=request,
        enqueued_at=time.time(),
        use_user_key=use_user_key,
    )

    await redis.zadd("llm:queue", {json.dumps(asdict(queued)): score})

    # Set a result placeholder
    await redis.setex(f"llm:result:{ticket_id}", 300, "pending")

    return ticket_id

Queue Workers

async def queue_worker():
    """Process requests from the queue."""

    while True:
        # Get highest priority item (lowest score)
        items = await redis.zpopmin("llm:queue", count=1)

        if not items:
            await asyncio.sleep(0.1)  # Brief pause if queue empty
            continue

        data, score = items[0]
        queued = QueuedRequest(**json.loads(data))

        try:
            # Select provider and execute
            response = await execute_llm_request(queued)

            # Store result
            await redis.setex(
                f"llm:result:{queued.ticket_id}",
                300,
                json.dumps({"status": "success", "response": asdict(response)})
            )

        except Exception as e:
            await redis.setex(
                f"llm:result:{queued.ticket_id}",
                300,
                json.dumps({"status": "error", "error": str(e)})
            )

async def wait_for_result(ticket_id: str, timeout: float = 120) -> LLMResponse:
    """Wait for queued request to complete."""

    deadline = time.time() + timeout

    while time.time() < deadline:
        result = await redis.get(f"llm:result:{ticket_id}")

        if result and result != "pending":
            data = json.loads(result)
            if data["status"] == "success":
                return LLMResponse(**data["response"])
            else:
                raise LLMError(data["error"])

        await asyncio.sleep(0.1)

    raise RequestTimeout("Request timed out")

Queue Health Monitoring

@dataclass
class QueueHealth:
    size: int
    oldest_wait_seconds: float
    by_tier: dict[str, int]
    status: str  # healthy, degraded, critical

async def get_queue_health() -> QueueHealth:
    """Get queue metrics for monitoring and load shedding."""

    queue_size = await redis.zcard("llm:queue")

    # Get oldest item
    oldest = await redis.zrange("llm:queue", 0, 0, withscores=True)
    if oldest:
        oldest_score = oldest[0][1]
        oldest_time = oldest_score % 1_000_000_000
        wait_time = time.time() - oldest_time
    else:
        wait_time = 0

    # Count by tier
    all_items = await redis.zrange("llm:queue", 0, -1)
    by_tier = {}
    for item in all_items:
        data = json.loads(item)
        tier = data.get("tier", "unknown")
        by_tier[tier] = by_tier.get(tier, 0) + 1

    # Determine status
    if queue_size < 500:
        status = "healthy"
    elif queue_size < 2000:
        status = "degraded"
    else:
        status = "critical"

    return QueueHealth(
        size=queue_size,
        oldest_wait_seconds=wait_time,
        by_tier=by_tier,
        status=status,
    )

Layer 4: Multi-Provider Pool with Circuit Breakers

Never depend on a single provider.

Provider Configuration

@dataclass
class ProviderConfig:
    name: str
    base_url: str
    api_key_env: str
    models: list[str]
    max_concurrent: int
    priority: int  # Lower = preferred
    timeout: float = 60.0

PROVIDERS = {
    "anthropic": ProviderConfig(
        name="anthropic",
        base_url="https://api.anthropic.com/v1",
        api_key_env="ANTHROPIC_API_KEY",
        models=["claude-sonnet-4-20250514", "claude-opus-4-20250514", "claude-haiku-3"],
        max_concurrent=100,
        priority=1,
    ),
    "openai": ProviderConfig(
        name="openai",
        base_url="https://api.openai.com/v1",
        api_key_env="OPENAI_API_KEY",
        models=["gpt-4o", "gpt-4o-mini", "o1", "o3-mini"],
        max_concurrent=50,
        priority=2,
    ),
    "xai": ProviderConfig(
        name="xai",
        base_url="https://api.x.ai/v1",
        api_key_env="XAI_API_KEY",
        models=["grok-3", "grok-3-mini"],
        max_concurrent=50,
        priority=1,
    ),
    "together": ProviderConfig(
        name="together",
        base_url="https://api.together.xyz/v1",
        api_key_env="TOGETHER_API_KEY",
        models=["llama-3-70b", "mixtral-8x7b"],
        max_concurrent=100,
        priority=3,  # Fallback
    ),
}

Circuit Breaker State

@dataclass
class CircuitState:
    provider: str
    healthy: bool = True
    failures: int = 0
    successes: int = 0
    last_failure: float = 0
    circuit_open_until: float = 0
    current_load: int = 0

# In-memory state (could be Redis for distributed)
CIRCUIT_STATES: dict[str, CircuitState] = {
    name: CircuitState(provider=name)
    for name in PROVIDERS
}

CIRCUIT_CONFIG = {
    "failure_threshold": 5,      # Failures before opening
    "success_threshold": 3,      # Successes before closing
    "open_duration": 30,         # Seconds circuit stays open
    "half_open_requests": 1,     # Requests allowed in half-open state
}

async def record_success(provider: str):
    """Record successful request."""
    state = CIRCUIT_STATES[provider]
    state.successes += 1
    state.failures = 0

    if not state.healthy and state.successes >= CIRCUIT_CONFIG["success_threshold"]:
        state.healthy = True
        logger.info(f"Circuit closed for {provider}")

async def record_failure(provider: str, error: Exception):
    """Record failed request, potentially open circuit."""
    state = CIRCUIT_STATES[provider]
    state.failures += 1
    state.successes = 0
    state.last_failure = time.time()

    if state.failures >= CIRCUIT_CONFIG["failure_threshold"]:
        state.healthy = False
        state.circuit_open_until = time.time() + CIRCUIT_CONFIG["open_duration"]
        logger.error(f"Circuit opened for {provider}: {error}")
        await alert_ops(f"LLM provider {provider} circuit opened")

def is_provider_available(provider: str) -> bool:
    """Check if provider can accept requests."""
    state = CIRCUIT_STATES[provider]
    config = PROVIDERS[provider]

    # Circuit open?
    if not state.healthy:
        if time.time() < state.circuit_open_until:
            return False
        # Half-open: allow limited requests to probe

    # At capacity?
    if state.current_load >= config.max_concurrent:
        return False

    return True

Provider Selection

def get_providers_for_model(model: str) -> list[str]:
    """Get providers that support this model."""
    return [
        name for name, config in PROVIDERS.items()
        if model in config.models or any(model.startswith(m.split("-")[0]) for m in config.models)
    ]

async def select_provider(request: LLMRequest, user_key: str | None = None) -> tuple[str, str]:
    """Select best available provider, return (provider_name, api_key)."""

    candidates = get_providers_for_model(request.model)

    if not candidates:
        raise UnsupportedModel(f"No provider supports model: {request.model}")

    # Filter to available providers
    available = [p for p in candidates if is_provider_available(p)]

    if not available:
        raise NoProvidersAvailable(
            "All providers for this model are currently unavailable. "
            "Please try again in a few seconds."
        )

    # Sort by priority, then by current load
    available.sort(key=lambda p: (
        PROVIDERS[p].priority,
        CIRCUIT_STATES[p].current_load / PROVIDERS[p].max_concurrent
    ))

    selected = available[0]

    # Determine API key
    if user_key:
        api_key = user_key
    else:
        api_key = os.environ[PROVIDERS[selected].api_key_env]

    return selected, api_key

Layer 5: BYOK (Bring Your Own Key)

Pro+ users can add their own API keys to bypass platform limits.

Database Schema

CREATE TABLE user_api_keys (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id) ON DELETE CASCADE,
    provider VARCHAR(50) NOT NULL,
    encrypted_key BYTEA NOT NULL,
    key_hint VARCHAR(20),  -- Last 4 chars for display: "...abc123"
    is_valid BOOLEAN DEFAULT true,
    last_used_at TIMESTAMPTZ,
    last_error VARCHAR(255),
    created_at TIMESTAMPTZ DEFAULT NOW(),

    UNIQUE(user_id, provider)
);

CREATE INDEX idx_user_api_keys_user ON user_api_keys(user_id);

Key Encryption

from cryptography.fernet import Fernet

# Platform encryption key (from environment, rotated periodically)
ENCRYPTION_KEY = Fernet(os.environ["API_KEY_ENCRYPTION_KEY"])

def encrypt_api_key(key: str) -> bytes:
    """Encrypt user's API key for storage."""
    return ENCRYPTION_KEY.encrypt(key.encode())

def decrypt_api_key(encrypted: bytes) -> str:
    """Decrypt user's API key for use."""
    return ENCRYPTION_KEY.decrypt(encrypted).decode()

async def store_user_api_key(user_id: str, provider: str, api_key: str):
    """Store encrypted API key for user."""

    # Validate key format
    if not validate_key_format(provider, api_key):
        raise InvalidAPIKey(f"Invalid {provider} API key format")

    # Test the key
    if not await test_api_key(provider, api_key):
        raise InvalidAPIKey(f"API key validation failed for {provider}")

    encrypted = encrypt_api_key(api_key)
    key_hint = f"...{api_key[-6:]}"

    await db.execute("""
        INSERT INTO user_api_keys (user_id, provider, encrypted_key, key_hint)
        VALUES ($1, $2, $3, $4)
        ON CONFLICT (user_id, provider)
        DO UPDATE SET encrypted_key = $3, key_hint = $4, is_valid = true, last_error = NULL
    """, user_id, provider, encrypted, key_hint)

async def get_user_api_key(user_id: str, provider: str) -> str | None:
    """Get decrypted API key for user, if they have one."""

    row = await db.fetchrow("""
        SELECT encrypted_key, is_valid
        FROM user_api_keys
        WHERE user_id = $1 AND provider = $2
    """, user_id, provider)

    if not row or not row["is_valid"]:
        return None

    return decrypt_api_key(row["encrypted_key"])

BYOK Request Flow

async def execute_with_byok(user: User, request: LLMRequest) -> LLMResponse:
    """Execute request, preferring user's own key if available."""

    # Check for user's key
    user_key = await get_user_api_key(user.id, get_provider_for_model(request.model))

    if user_key:
        # Use user's key - bypass platform rate limits
        try:
            response = await call_provider_direct(request, user_key)

            # Update last used
            await db.execute("""
                UPDATE user_api_keys
                SET last_used_at = NOW(), last_error = NULL
                WHERE user_id = $1 AND provider = $2
            """, user.id, request.provider)

            return response

        except AuthenticationError:
            # Key is invalid - mark it and fall back to platform
            await db.execute("""
                UPDATE user_api_keys
                SET is_valid = false, last_error = 'Authentication failed'
                WHERE user_id = $1 AND provider = $2
            """, user.id, request.provider)

            # Notify user
            await send_notification(user, "api_key_invalid", {
                "provider": request.provider
            })

            # Fall through to platform key

    # Use platform key (with rate limiting)
    return await execute_with_platform_key(user, request)

Layer 6: Backpressure & Graceful Degradation

When overwhelmed, fail gracefully and prioritize paid users.

Load Shedding

async def should_shed_load(user: User, queue_health: QueueHealth) -> bool:
    """Determine if this request should be rejected to protect the system."""

    # High Frequency and Enterprise never shed
    if user.tier in [Tier.HIGH_FREQUENCY, Tier.ENTERPRISE]:
        return False

    # Pro shed only in critical
    if user.tier == Tier.PRO and queue_health.status != "critical":
        return False

    # Free tier shed in degraded or critical
    if user.tier == Tier.FREE and queue_health.status in ["degraded", "critical"]:
        # Probabilistic shedding based on queue size
        shed_probability = min(0.9, (queue_health.size - 500) / 2000)
        return random.random() < shed_probability

    return False

Graceful Error Messages

class ServiceDegraded(Exception):
    """Raised when load shedding rejects a request."""

    def __init__(self, tier: str, queue_health: QueueHealth):
        if tier == Tier.FREE:
            message = (
                "We're experiencing high demand. Free tier requests are "
                "temporarily paused. Upgrade to Pro for priority access, "
                "or try again in a few minutes."
            )
            retry_after = 60
        else:
            message = (
                "High demand is causing delays. Your request has been queued. "
                "Expected wait time: ~{} seconds."
            ).format(int(queue_health.oldest_wait_seconds * 1.5))
            retry_after = 30

        self.message = message
        self.retry_after = retry_after
        super().__init__(message)

Timeout Handling

async def execute_with_timeout(request: LLMRequest, provider: str, api_key: str) -> LLMResponse:
    """Execute request with appropriate timeout."""

    # Timeout based on expected response size
    if request.max_tokens and request.max_tokens > 2000:
        timeout = 120  # Long responses need more time
    else:
        timeout = 60

    try:
        async with asyncio.timeout(timeout):
            return await call_provider(request, provider, api_key)
    except asyncio.TimeoutError:
        await record_failure(provider, TimeoutError("Request timed out"))
        raise RequestTimeout(
            f"Request timed out after {timeout}s. "
            "Try reducing max_tokens or simplifying the prompt."
        )

Main Entry Point

async def handle_llm_request(user: User, request: LLMRequest) -> LLMResponse:
    """
    Main entry point for all LLM requests.
    Implements full defense-in-depth stack.
    """

    concurrent_key = None

    try:
        # Layer 1: Rate limiting
        rate_result = await rate_limit_check(user, request)
        if not rate_result.allowed:
            raise RateLimitExceeded(
                message=rate_result.reason,
                retry_after=rate_result.retry_after
            )
        concurrent_key = rate_result.concurrent_key

        # Layer 2: Semantic cache
        cached = await check_semantic_cache(request)
        if cached:
            return cached

        # Layer 3: Check queue health for load shedding
        queue_health = await get_queue_health()
        if await should_shed_load(user, queue_health):
            raise ServiceDegraded(user.tier, queue_health)

        # Layer 4: Enqueue with priority
        ticket_id = await enqueue_request(user, request, rate_result.use_user_key)

        # Layer 5: Wait for result
        response = await wait_for_result(ticket_id, timeout=120)

        # Layer 6: Cache successful response
        await cache_response(request, response)

        return response

    finally:
        # Always release concurrent slot
        if concurrent_key:
            await release_concurrent(concurrent_key)

Monitoring & Alerts

Key Metrics

Metric Source Warning Critical
Queue depth Redis ZCARD > 500 > 2000
P50 latency Request timing > 10s > 30s
P99 latency Request timing > 60s > 120s
Cache hit rate Redis stats < 25% < 10%
Provider error rate Circuit state > 5% > 20%
Circuit breaker open Circuit state Any Multiple
Free tier rejection rate Load shedding > 20% > 50%

Alerting

# PagerDuty / Slack alerts
ALERTS = {
    "queue_critical": {
        "condition": lambda h: h.size > 2000,
        "severity": "critical",
        "message": "LLM queue depth critical: {size} requests backed up"
    },
    "provider_down": {
        "condition": lambda p: not p.healthy,
        "severity": "warning",
        "message": "Provider {name} circuit breaker open"
    },
    "all_providers_down": {
        "condition": lambda: all(not s.healthy for s in CIRCUIT_STATES.values()),
        "severity": "critical",
        "message": "ALL LLM providers are down!"
    },
}

Dashboard Queries

-- Requests per minute by tier
SELECT
    date_trunc('minute', created_at) as minute,
    tier,
    COUNT(*) as requests
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY 1, 2
ORDER BY 1 DESC;

-- Error rate by provider
SELECT
    provider,
    COUNT(*) FILTER (WHERE status = 'error') * 100.0 / COUNT(*) as error_rate
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY provider;

-- BYOK adoption
SELECT
    tier,
    COUNT(*) FILTER (WHERE used_user_key) * 100.0 / COUNT(*) as byok_percentage
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY tier;

Viral Day Playbook

What to do when that tweet hits:

Hour 0-1: Detection

  • Alert: Queue depth > 500
  • Action: Monitor, no intervention needed

Hour 1-2: Escalation

  • Alert: Queue depth > 1000, latency spiking
  • Action:
    • Verify all provider circuits are healthy
    • Check cache hit rate (should be climbing)
    • Prepare to enable aggressive load shedding

Hour 2-4: Peak

  • Alert: Queue depth > 2000, free tier rejections > 30%
  • Action:
    • Enable aggressive load shedding for free tier
    • Send "high demand" email to free users with upgrade CTA
    • Monitor Pro/Enterprise latency (must stay < 30s)
    • Tweet acknowledgment: "We're experiencing high demand due to [reason]. Pro users unaffected."

Hour 4-8: Stabilization

  • Queue draining as cache warms and load shedding works
  • Many users convert to Pro or add BYOK keys
  • Circuits recovering as providers stabilize

Post-Mortem

  • Review metrics: peak queue, rejection rate, conversion rate
  • Adjust tier limits if needed
  • Consider adding provider capacity for sustained growth

References