BloxServer API (FastAPI + SQLAlchemy async): - Database models: users, flows, triggers, executions, usage tracking - Clerk JWT auth with dev mode bypass for local testing - SQLite support for local dev, PostgreSQL for production - CRUD routes for flows, triggers, executions - Public webhook endpoint with token auth - Health/readiness endpoints - Pydantic schemas with camelCase aliases for frontend - Docker + docker-compose setup Architecture documentation: - Librarian architecture with RLM-powered query engine - Stripe billing integration (usage-based, trials, webhooks) - LLM abstraction layer (rate limiting, semantic cache, failover) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
30 KiB
BloxServer LLM Abstraction Layer — Resilient Multi-Provider Architecture
Status: Design Date: January 2026
Overview
The LLM abstraction layer is the critical path for all AI operations in BloxServer. It must handle:
- Viral growth: 100 → 10,000 users overnight
- Provider outages: Single provider down ≠ platform down
- Fair access: Paid users prioritized, free users served fairly
- Cost control: Platform keys vs BYOK (Bring Your Own Key)
- Low latency: Sub-second for simple calls, reasonable for complex
This document specifies the defense-in-depth architecture that survives success.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ LLM Abstraction Layer │
│ │
│ Request → [Rate Limit] → [Cache Check] → [Queue] → [Dispatch] │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ Per-user Semantic Priority Provider │
│ per-tier cache queues pool + │
│ limits (30%+ hits) (by tier) failover │
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ BYOK (Bring Your Own Key) ││
│ │ Pro+ users with own API keys bypass platform limits ││
│ └─────────────────────────────────────────────────────────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ High Frequency Tier ││
│ │ Dedicated capacity, custom SLA — contact sales ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Tier Limits
| Tier | Price | Requests/min | Tokens/min | Concurrent | Latency SLA |
|---|---|---|---|---|---|
| Free | $0 | 10 | 10,000 | 2 | Best effort |
| Pro | $29/mo | 60 | 100,000 | 10 | < 30s P95 |
| Enterprise | Custom | 300 | 500,000 | 50 | < 10s P95 |
| High Frequency | Custom | Custom | Custom | Dedicated | Custom SLA |
| BYOK (any tier) | — | Unlimited* | Unlimited* | 20 | User's provider |
*BYOK users are limited only by their own provider's rate limits.
High Frequency Tier
For users requiring:
- Low latency: Sub-second response times
- High throughput: Thousands of requests per minute
- Guaranteed capacity: Dedicated provider allocations
- Custom models: Fine-tuned or private deployments
Use cases:
- Real-time trading signals
- Live customer support at scale
- High-volume content generation
- Latency-sensitive applications
Pricing: Custom — based on capacity reservation, SLA requirements, and volume.
Landing page CTA:
┌─────────────────────────────────────────────────────────────┐
│ │
│ Need High Frequency? │
│ │
│ Building something that needs thousands of requests per │
│ minute with sub-second latency? Let's talk dedicated │
│ capacity and custom SLAs. │
│ │
│ [Contact Sales →] │
│ │
└─────────────────────────────────────────────────────────────┘
Layer 1: Intake Rate Limiting
First line of defense. Rejects requests before they consume resources.
Implementation
from dataclasses import dataclass
from enum import Enum
import time
class Tier(Enum):
FREE = "free"
PRO = "pro"
ENTERPRISE = "enterprise"
HIGH_FREQUENCY = "high_frequency"
@dataclass
class TierLimits:
requests_per_minute: int
tokens_per_minute: int
max_concurrent: int
TIER_LIMITS = {
Tier.FREE: TierLimits(10, 10_000, 2),
Tier.PRO: TierLimits(60, 100_000, 10),
Tier.ENTERPRISE: TierLimits(300, 500_000, 50),
Tier.HIGH_FREQUENCY: TierLimits(10_000, 10_000_000, 500), # Custom per customer
}
@dataclass
class RateLimitResult:
allowed: bool
use_user_key: bool = False
retry_after: int | None = None
reason: str | None = None
concurrent_key: str | None = None
async def rate_limit_check(user: User, request: LLMRequest) -> RateLimitResult:
"""Check if user can make this request."""
# BYOK users bypass platform limits
if user.has_own_api_key(request.provider):
return RateLimitResult(allowed=True, use_user_key=True)
limits = TIER_LIMITS[user.tier]
# Check requests per minute (sliding window)
rpm_key = f"ratelimit:{user.id}:rpm"
now = time.time()
window_start = now - 60
# Remove old entries, add new one, count
pipe = redis.pipeline()
pipe.zremrangebyscore(rpm_key, 0, window_start)
pipe.zadd(rpm_key, {str(now): now})
pipe.zcard(rpm_key)
pipe.expire(rpm_key, 120)
_, _, current_rpm, _ = await pipe.execute()
if current_rpm > limits.requests_per_minute:
return RateLimitResult(
allowed=False,
retry_after=int(60 - (now - window_start)),
reason=f"Rate limit: {limits.requests_per_minute} requests/minute"
)
# Check concurrent requests
concurrent_key = f"ratelimit:{user.id}:concurrent"
current_concurrent = await redis.incr(concurrent_key)
await redis.expire(concurrent_key, 300) # 5 min TTL as safety
if current_concurrent > limits.max_concurrent:
await redis.decr(concurrent_key)
return RateLimitResult(
allowed=False,
retry_after=1,
reason=f"Max concurrent: {limits.max_concurrent} requests"
)
return RateLimitResult(allowed=True, concurrent_key=concurrent_key)
async def release_concurrent(concurrent_key: str):
"""Release concurrent slot after request completes."""
if concurrent_key:
await redis.decr(concurrent_key)
Rate Limit Headers
Return standard headers so clients can self-regulate:
def rate_limit_headers(user: User) -> dict:
limits = TIER_LIMITS[user.tier]
current = await get_current_usage(user.id)
return {
"X-RateLimit-Limit": str(limits.requests_per_minute),
"X-RateLimit-Remaining": str(max(0, limits.requests_per_minute - current.rpm)),
"X-RateLimit-Reset": str(int(time.time()) + 60),
}
Layer 2: Semantic Cache
Identical requests return cached responses. Reduces load and cost.
Cache Key Generation
import hashlib
import json
def hash_request(request: LLMRequest) -> str:
"""Generate deterministic cache key for request."""
# Include all parameters that affect output
cache_input = {
"model": request.model,
"messages": [
{"role": m.role, "content": m.content}
for m in request.messages
],
"temperature": request.temperature,
"max_tokens": request.max_tokens,
"tools": request.tools, # Tool definitions matter
# Exclude: user_id, timestamps, request_id
}
serialized = json.dumps(cache_input, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()[:32]
Cache Logic
@dataclass
class CachedResponse:
response: LLMResponse
cached_at: float
hit_count: int
async def check_semantic_cache(request: LLMRequest) -> LLMResponse | None:
"""Check if we've seen this exact request before."""
cache_key = f"llmcache:{hash_request(request)}"
cached = await redis.get(cache_key)
if cached:
data = json.loads(cached)
# Update hit count for analytics
await redis.hincrby(f"llmcache:stats", "hits", 1)
return LLMResponse(
content=data["content"],
model=data["model"],
usage=data["usage"],
cached=True,
)
await redis.hincrby(f"llmcache:stats", "misses", 1)
return None
async def cache_response(request: LLMRequest, response: LLMResponse):
"""Cache response with TTL based on determinism."""
# Don't cache errors or empty responses
if response.error or not response.content:
return
cache_key = f"llmcache:{hash_request(request)}"
# TTL based on temperature (determinism)
if request.temperature == 0:
ttl = 86400 # 24 hours for deterministic
elif request.temperature < 0.3:
ttl = 3600 # 1 hour
elif request.temperature < 0.7:
ttl = 300 # 5 minutes
else:
return # Don't cache high-temperature responses
cache_data = {
"content": response.content,
"model": response.model,
"usage": response.usage,
"cached_at": time.time(),
}
await redis.setex(cache_key, ttl, json.dumps(cache_data))
Expected Cache Performance
| Use Case | Temperature | Expected Hit Rate |
|---|---|---|
| Tool calls (same inputs) | 0 | 70-90% |
| Structured extraction | 0-0.3 | 50-70% |
| Agent reasoning | 0.5-0.7 | 20-40% |
| Creative content | 0.8-1.0 | ~0% |
Aggregate impact: 30-40% reduction in API calls for typical workloads.
Layer 3: Priority Queues
Paid users get priority. Free users are served fairly but can be shed under load.
Queue Structure
# Redis sorted set with composite score
# Score = (priority * 1B) + timestamp
# Lower score = higher priority + earlier arrival
QUEUE_PRIORITIES = {
Tier.HIGH_FREQUENCY: 0, # Highest priority (dedicated customers)
Tier.ENTERPRISE: 1,
Tier.PRO: 2,
"trial": 2, # Trials get Pro priority (first impression)
Tier.FREE: 3, # Lowest priority
}
@dataclass
class QueuedRequest:
ticket_id: str
user_id: str
tier: str
request: LLMRequest
enqueued_at: float
use_user_key: bool = False
async def enqueue_request(user: User, request: LLMRequest, use_user_key: bool) -> str:
"""Add request to priority queue, return ticket ID."""
ticket_id = f"ticket:{uuid.uuid4().hex}"
priority = QUEUE_PRIORITIES.get(user.tier, 3)
# Composite score: priority (billions) + timestamp (seconds)
score = priority * 1_000_000_000 + time.time()
queued = QueuedRequest(
ticket_id=ticket_id,
user_id=str(user.id),
tier=user.tier,
request=request,
enqueued_at=time.time(),
use_user_key=use_user_key,
)
await redis.zadd("llm:queue", {json.dumps(asdict(queued)): score})
# Set a result placeholder
await redis.setex(f"llm:result:{ticket_id}", 300, "pending")
return ticket_id
Queue Workers
async def queue_worker():
"""Process requests from the queue."""
while True:
# Get highest priority item (lowest score)
items = await redis.zpopmin("llm:queue", count=1)
if not items:
await asyncio.sleep(0.1) # Brief pause if queue empty
continue
data, score = items[0]
queued = QueuedRequest(**json.loads(data))
try:
# Select provider and execute
response = await execute_llm_request(queued)
# Store result
await redis.setex(
f"llm:result:{queued.ticket_id}",
300,
json.dumps({"status": "success", "response": asdict(response)})
)
except Exception as e:
await redis.setex(
f"llm:result:{queued.ticket_id}",
300,
json.dumps({"status": "error", "error": str(e)})
)
async def wait_for_result(ticket_id: str, timeout: float = 120) -> LLMResponse:
"""Wait for queued request to complete."""
deadline = time.time() + timeout
while time.time() < deadline:
result = await redis.get(f"llm:result:{ticket_id}")
if result and result != "pending":
data = json.loads(result)
if data["status"] == "success":
return LLMResponse(**data["response"])
else:
raise LLMError(data["error"])
await asyncio.sleep(0.1)
raise RequestTimeout("Request timed out")
Queue Health Monitoring
@dataclass
class QueueHealth:
size: int
oldest_wait_seconds: float
by_tier: dict[str, int]
status: str # healthy, degraded, critical
async def get_queue_health() -> QueueHealth:
"""Get queue metrics for monitoring and load shedding."""
queue_size = await redis.zcard("llm:queue")
# Get oldest item
oldest = await redis.zrange("llm:queue", 0, 0, withscores=True)
if oldest:
oldest_score = oldest[0][1]
oldest_time = oldest_score % 1_000_000_000
wait_time = time.time() - oldest_time
else:
wait_time = 0
# Count by tier
all_items = await redis.zrange("llm:queue", 0, -1)
by_tier = {}
for item in all_items:
data = json.loads(item)
tier = data.get("tier", "unknown")
by_tier[tier] = by_tier.get(tier, 0) + 1
# Determine status
if queue_size < 500:
status = "healthy"
elif queue_size < 2000:
status = "degraded"
else:
status = "critical"
return QueueHealth(
size=queue_size,
oldest_wait_seconds=wait_time,
by_tier=by_tier,
status=status,
)
Layer 4: Multi-Provider Pool with Circuit Breakers
Never depend on a single provider.
Provider Configuration
@dataclass
class ProviderConfig:
name: str
base_url: str
api_key_env: str
models: list[str]
max_concurrent: int
priority: int # Lower = preferred
timeout: float = 60.0
PROVIDERS = {
"anthropic": ProviderConfig(
name="anthropic",
base_url="https://api.anthropic.com/v1",
api_key_env="ANTHROPIC_API_KEY",
models=["claude-sonnet-4-20250514", "claude-opus-4-20250514", "claude-haiku-3"],
max_concurrent=100,
priority=1,
),
"openai": ProviderConfig(
name="openai",
base_url="https://api.openai.com/v1",
api_key_env="OPENAI_API_KEY",
models=["gpt-4o", "gpt-4o-mini", "o1", "o3-mini"],
max_concurrent=50,
priority=2,
),
"xai": ProviderConfig(
name="xai",
base_url="https://api.x.ai/v1",
api_key_env="XAI_API_KEY",
models=["grok-3", "grok-3-mini"],
max_concurrent=50,
priority=1,
),
"together": ProviderConfig(
name="together",
base_url="https://api.together.xyz/v1",
api_key_env="TOGETHER_API_KEY",
models=["llama-3-70b", "mixtral-8x7b"],
max_concurrent=100,
priority=3, # Fallback
),
}
Circuit Breaker State
@dataclass
class CircuitState:
provider: str
healthy: bool = True
failures: int = 0
successes: int = 0
last_failure: float = 0
circuit_open_until: float = 0
current_load: int = 0
# In-memory state (could be Redis for distributed)
CIRCUIT_STATES: dict[str, CircuitState] = {
name: CircuitState(provider=name)
for name in PROVIDERS
}
CIRCUIT_CONFIG = {
"failure_threshold": 5, # Failures before opening
"success_threshold": 3, # Successes before closing
"open_duration": 30, # Seconds circuit stays open
"half_open_requests": 1, # Requests allowed in half-open state
}
async def record_success(provider: str):
"""Record successful request."""
state = CIRCUIT_STATES[provider]
state.successes += 1
state.failures = 0
if not state.healthy and state.successes >= CIRCUIT_CONFIG["success_threshold"]:
state.healthy = True
logger.info(f"Circuit closed for {provider}")
async def record_failure(provider: str, error: Exception):
"""Record failed request, potentially open circuit."""
state = CIRCUIT_STATES[provider]
state.failures += 1
state.successes = 0
state.last_failure = time.time()
if state.failures >= CIRCUIT_CONFIG["failure_threshold"]:
state.healthy = False
state.circuit_open_until = time.time() + CIRCUIT_CONFIG["open_duration"]
logger.error(f"Circuit opened for {provider}: {error}")
await alert_ops(f"LLM provider {provider} circuit opened")
def is_provider_available(provider: str) -> bool:
"""Check if provider can accept requests."""
state = CIRCUIT_STATES[provider]
config = PROVIDERS[provider]
# Circuit open?
if not state.healthy:
if time.time() < state.circuit_open_until:
return False
# Half-open: allow limited requests to probe
# At capacity?
if state.current_load >= config.max_concurrent:
return False
return True
Provider Selection
def get_providers_for_model(model: str) -> list[str]:
"""Get providers that support this model."""
return [
name for name, config in PROVIDERS.items()
if model in config.models or any(model.startswith(m.split("-")[0]) for m in config.models)
]
async def select_provider(request: LLMRequest, user_key: str | None = None) -> tuple[str, str]:
"""Select best available provider, return (provider_name, api_key)."""
candidates = get_providers_for_model(request.model)
if not candidates:
raise UnsupportedModel(f"No provider supports model: {request.model}")
# Filter to available providers
available = [p for p in candidates if is_provider_available(p)]
if not available:
raise NoProvidersAvailable(
"All providers for this model are currently unavailable. "
"Please try again in a few seconds."
)
# Sort by priority, then by current load
available.sort(key=lambda p: (
PROVIDERS[p].priority,
CIRCUIT_STATES[p].current_load / PROVIDERS[p].max_concurrent
))
selected = available[0]
# Determine API key
if user_key:
api_key = user_key
else:
api_key = os.environ[PROVIDERS[selected].api_key_env]
return selected, api_key
Layer 5: BYOK (Bring Your Own Key)
Pro+ users can add their own API keys to bypass platform limits.
Database Schema
CREATE TABLE user_api_keys (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE,
provider VARCHAR(50) NOT NULL,
encrypted_key BYTEA NOT NULL,
key_hint VARCHAR(20), -- Last 4 chars for display: "...abc123"
is_valid BOOLEAN DEFAULT true,
last_used_at TIMESTAMPTZ,
last_error VARCHAR(255),
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(user_id, provider)
);
CREATE INDEX idx_user_api_keys_user ON user_api_keys(user_id);
Key Encryption
from cryptography.fernet import Fernet
# Platform encryption key (from environment, rotated periodically)
ENCRYPTION_KEY = Fernet(os.environ["API_KEY_ENCRYPTION_KEY"])
def encrypt_api_key(key: str) -> bytes:
"""Encrypt user's API key for storage."""
return ENCRYPTION_KEY.encrypt(key.encode())
def decrypt_api_key(encrypted: bytes) -> str:
"""Decrypt user's API key for use."""
return ENCRYPTION_KEY.decrypt(encrypted).decode()
async def store_user_api_key(user_id: str, provider: str, api_key: str):
"""Store encrypted API key for user."""
# Validate key format
if not validate_key_format(provider, api_key):
raise InvalidAPIKey(f"Invalid {provider} API key format")
# Test the key
if not await test_api_key(provider, api_key):
raise InvalidAPIKey(f"API key validation failed for {provider}")
encrypted = encrypt_api_key(api_key)
key_hint = f"...{api_key[-6:]}"
await db.execute("""
INSERT INTO user_api_keys (user_id, provider, encrypted_key, key_hint)
VALUES ($1, $2, $3, $4)
ON CONFLICT (user_id, provider)
DO UPDATE SET encrypted_key = $3, key_hint = $4, is_valid = true, last_error = NULL
""", user_id, provider, encrypted, key_hint)
async def get_user_api_key(user_id: str, provider: str) -> str | None:
"""Get decrypted API key for user, if they have one."""
row = await db.fetchrow("""
SELECT encrypted_key, is_valid
FROM user_api_keys
WHERE user_id = $1 AND provider = $2
""", user_id, provider)
if not row or not row["is_valid"]:
return None
return decrypt_api_key(row["encrypted_key"])
BYOK Request Flow
async def execute_with_byok(user: User, request: LLMRequest) -> LLMResponse:
"""Execute request, preferring user's own key if available."""
# Check for user's key
user_key = await get_user_api_key(user.id, get_provider_for_model(request.model))
if user_key:
# Use user's key - bypass platform rate limits
try:
response = await call_provider_direct(request, user_key)
# Update last used
await db.execute("""
UPDATE user_api_keys
SET last_used_at = NOW(), last_error = NULL
WHERE user_id = $1 AND provider = $2
""", user.id, request.provider)
return response
except AuthenticationError:
# Key is invalid - mark it and fall back to platform
await db.execute("""
UPDATE user_api_keys
SET is_valid = false, last_error = 'Authentication failed'
WHERE user_id = $1 AND provider = $2
""", user.id, request.provider)
# Notify user
await send_notification(user, "api_key_invalid", {
"provider": request.provider
})
# Fall through to platform key
# Use platform key (with rate limiting)
return await execute_with_platform_key(user, request)
Layer 6: Backpressure & Graceful Degradation
When overwhelmed, fail gracefully and prioritize paid users.
Load Shedding
async def should_shed_load(user: User, queue_health: QueueHealth) -> bool:
"""Determine if this request should be rejected to protect the system."""
# High Frequency and Enterprise never shed
if user.tier in [Tier.HIGH_FREQUENCY, Tier.ENTERPRISE]:
return False
# Pro shed only in critical
if user.tier == Tier.PRO and queue_health.status != "critical":
return False
# Free tier shed in degraded or critical
if user.tier == Tier.FREE and queue_health.status in ["degraded", "critical"]:
# Probabilistic shedding based on queue size
shed_probability = min(0.9, (queue_health.size - 500) / 2000)
return random.random() < shed_probability
return False
Graceful Error Messages
class ServiceDegraded(Exception):
"""Raised when load shedding rejects a request."""
def __init__(self, tier: str, queue_health: QueueHealth):
if tier == Tier.FREE:
message = (
"We're experiencing high demand. Free tier requests are "
"temporarily paused. Upgrade to Pro for priority access, "
"or try again in a few minutes."
)
retry_after = 60
else:
message = (
"High demand is causing delays. Your request has been queued. "
"Expected wait time: ~{} seconds."
).format(int(queue_health.oldest_wait_seconds * 1.5))
retry_after = 30
self.message = message
self.retry_after = retry_after
super().__init__(message)
Timeout Handling
async def execute_with_timeout(request: LLMRequest, provider: str, api_key: str) -> LLMResponse:
"""Execute request with appropriate timeout."""
# Timeout based on expected response size
if request.max_tokens and request.max_tokens > 2000:
timeout = 120 # Long responses need more time
else:
timeout = 60
try:
async with asyncio.timeout(timeout):
return await call_provider(request, provider, api_key)
except asyncio.TimeoutError:
await record_failure(provider, TimeoutError("Request timed out"))
raise RequestTimeout(
f"Request timed out after {timeout}s. "
"Try reducing max_tokens or simplifying the prompt."
)
Main Entry Point
async def handle_llm_request(user: User, request: LLMRequest) -> LLMResponse:
"""
Main entry point for all LLM requests.
Implements full defense-in-depth stack.
"""
concurrent_key = None
try:
# Layer 1: Rate limiting
rate_result = await rate_limit_check(user, request)
if not rate_result.allowed:
raise RateLimitExceeded(
message=rate_result.reason,
retry_after=rate_result.retry_after
)
concurrent_key = rate_result.concurrent_key
# Layer 2: Semantic cache
cached = await check_semantic_cache(request)
if cached:
return cached
# Layer 3: Check queue health for load shedding
queue_health = await get_queue_health()
if await should_shed_load(user, queue_health):
raise ServiceDegraded(user.tier, queue_health)
# Layer 4: Enqueue with priority
ticket_id = await enqueue_request(user, request, rate_result.use_user_key)
# Layer 5: Wait for result
response = await wait_for_result(ticket_id, timeout=120)
# Layer 6: Cache successful response
await cache_response(request, response)
return response
finally:
# Always release concurrent slot
if concurrent_key:
await release_concurrent(concurrent_key)
Monitoring & Alerts
Key Metrics
| Metric | Source | Warning | Critical |
|---|---|---|---|
| Queue depth | Redis ZCARD | > 500 | > 2000 |
| P50 latency | Request timing | > 10s | > 30s |
| P99 latency | Request timing | > 60s | > 120s |
| Cache hit rate | Redis stats | < 25% | < 10% |
| Provider error rate | Circuit state | > 5% | > 20% |
| Circuit breaker open | Circuit state | Any | Multiple |
| Free tier rejection rate | Load shedding | > 20% | > 50% |
Alerting
# PagerDuty / Slack alerts
ALERTS = {
"queue_critical": {
"condition": lambda h: h.size > 2000,
"severity": "critical",
"message": "LLM queue depth critical: {size} requests backed up"
},
"provider_down": {
"condition": lambda p: not p.healthy,
"severity": "warning",
"message": "Provider {name} circuit breaker open"
},
"all_providers_down": {
"condition": lambda: all(not s.healthy for s in CIRCUIT_STATES.values()),
"severity": "critical",
"message": "ALL LLM providers are down!"
},
}
Dashboard Queries
-- Requests per minute by tier
SELECT
date_trunc('minute', created_at) as minute,
tier,
COUNT(*) as requests
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY 1, 2
ORDER BY 1 DESC;
-- Error rate by provider
SELECT
provider,
COUNT(*) FILTER (WHERE status = 'error') * 100.0 / COUNT(*) as error_rate
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY provider;
-- BYOK adoption
SELECT
tier,
COUNT(*) FILTER (WHERE used_user_key) * 100.0 / COUNT(*) as byok_percentage
FROM llm_requests
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY tier;
Viral Day Playbook
What to do when that tweet hits:
Hour 0-1: Detection
- Alert: Queue depth > 500
- Action: Monitor, no intervention needed
Hour 1-2: Escalation
- Alert: Queue depth > 1000, latency spiking
- Action:
- Verify all provider circuits are healthy
- Check cache hit rate (should be climbing)
- Prepare to enable aggressive load shedding
Hour 2-4: Peak
- Alert: Queue depth > 2000, free tier rejections > 30%
- Action:
- Enable aggressive load shedding for free tier
- Send "high demand" email to free users with upgrade CTA
- Monitor Pro/Enterprise latency (must stay < 30s)
- Tweet acknowledgment: "We're experiencing high demand due to [reason]. Pro users unaffected."
Hour 4-8: Stabilization
- Queue draining as cache warms and load shedding works
- Many users convert to Pro or add BYOK keys
- Circuits recovering as providers stabilize
Post-Mortem
- Review metrics: peak queue, rejection rate, conversion rate
- Adjust tier limits if needed
- Consider adding provider capacity for sustained growth
References
- Stripe-style rate limiting
- Circuit breaker pattern
- Token bucket algorithm
- BloxServer Billing — Tier definitions and pricing