xml-pipeline/docs/archive-obsolete/token-scheduling-issues.md
2026-01-07 13:12:08 -08:00

83 lines
No EOL
4.5 KiB
Markdown

# Proposed Scheduling Enhancements for Token-Constrained Environments
**January 05, 2026**
These ideas (originally surfaced by Gemini) introduce token-aware and fairness-oriented scheduling on top of the current AgentServer v2.0 message pump. The current pump already uses per-thread queues with configurable breadth-first or depth-first draining, but these suggestions shift focus to **per-agent** fairness and explicit handling of shared LLM API rate limits (TPM/RPM).
They are presented here cleaned up and structured for easier discussion and potential adoption.
### Per-Agent Buffer Pools (Targeted Fairness)
**Concept**
Instead of a single global queue or purely thread-based queues, each registered agent (especially LLM-based listeners) gets its own dedicated message buffer.
**Benefit**
- Guarantees "system attention" round-robin across all agents.
- Prevents a high-volume agent (e.g., a central researcher or coordinator) from starving others.
- Caps the number of active queues to the number of agents rather than potentially unbounded threads.
### Token-Aware Weighted Deficit Round Robin (TA-WDRR)
**Concept**
Treat the provider's Tokens Per Minute (TPM) limit as a shared "power budget". Each agent maintains a deficit counter that accumulates each scheduling round.
**Logic**
1. The pump looks at the next message in an agent's buffer.
2. It estimates the token cost of that message (prompt + max_tokens).
3. If the cost exceeds the agent's current deficit **or** the remaining global budget, skip that agent and try another.
4. Select a smaller job from another agent that fits the remaining budget "bin".
**Benefit**
Maximizes overall throughput by opportunistically filling small budget gaps with lightweight tasks while large reasoning jobs wait for the next budget refill.
### Adaptive Congestion Control (Servo Loop)
**Concept**
A feedback control system that reacts to 429 rate-limit errors by dynamically adjusting the pump's assumed TPM budget.
**Mechanism**
- Uses Additive Increase Multiplicative Decrease (AIMD).
- On rate-limit error → immediately halve the local TPM budget.
- On successful calls → gradually increase the budget to probe the provider's true capacity.
**Benefit**
Avoids "thundering herd" retry storms and smoothly converges to the maximum sustainable rate without manual tuning.
### Feedforward Prep-Side Token Estimation
**Concept**
During early message preparation (repair, validation, deserialization), pre-compute and tag each message with an `estimated_tokens` attribute.
**Logic**
The scheduler can use this tag to make informed decisions **before** handing the message to the LLM abstraction layer, avoiding late failures.
**Benefit**
Enables predictive skipping/reordering without waiting for the API call to fail.
### Context-Isolated Memory via Dot-Notation Thread Paths
**Concept**
Use the full hierarchical thread path (e.g., `sess-abcd1234.researcher.search.calc`) as the unique key for per-conversation memory/state.
**Logic**
Even when multiple threads call the same shared tool/agent, their histories and any "memory button" state remain strictly partitioned by the thread path.
**Benefit**
- Prevents context poisoning across parallel branches.
- Allows stateless, horizontally scalable agents while preserving private conversation continuity.
### Key Metrics for Monitoring ("Speedometer")
| Metric | Description |
|-------------------------|-----------------------------------------------------------------------------|
| TPM (Tokens Per Minute) | Real-time rolling window of input + output tokens across the organism. |
| RPM (Requests Per Minute) | Count of individual API calls to avoid separate request-rate throttling. |
| Reservation Variance | Difference between reserved max_tokens and actual consumption; used to "refund" budget in real time. |
These enhancements would layer naturally on top of the existing per-thread queue model:
- Threads remain the unit of conversation memory and hierarchical tracing (unchanged).
- Agents become the unit of scheduling fairness and token budgeting.
- The dispatcher loop could select the next **agent** to service (round-robin or weighted), then drain the highest-priority thread queue belonging to that agent, applying the token-aware checks.
This keeps the current thread-oblivious, provenance-preserving design while adding production-grade rate-limit resilience and fairness for LLM-heavy workloads. Worth considering for a future v2.1 scheduling module.