Seer: Online Context Learning That Shrinks RL Rollout Tails and Boosts Throughput
'Seer restructures synchronous RL rollouts to reduce tail latency and improve throughput by up to 97% using divided rollout, context aware scheduling and grouped speculative decoding.'
The rollout bottleneck in synchronous RL
Reinforcement learning for large reasoning models often stalls because synchronous on policy setups wait for every rollout response before proceeding. Modern chain of thought style workloads generate very long outputs, causing a few straggling requests to dominate iteration time. Moonshot AI and Tsinghua researchers propose Seer to address this systems bottleneck without changing the underlying RL algorithm.
Why synchronous rollouts are slow for reasoning models
Large reasoning models in the Seer experiments include Moonlight, Qwen2 VL 72B and Kimi K2. These tasks run on 32 compute nodes with 8 H800 GPUs per node, using between 32 and 256 GPUs across experiments and hundreds of prompts per iteration. Maximum generation lengths are massive: Moonlight up to 65,536 tokens, Qwen2 VL 72B up to 40,960 tokens and Kimi K2 up to 98,304 tokens. As decoding progresses, a single long chain of thought request can grow KVCache usage from a few hundred megabytes to tens of gigabytes.
This memory growth forces instances to reduce concurrency or preempt requests, which triggers expensive redecoding. The research team defines tail requests as the last 10 percent of requests to finish in a rollout. In baseline synchronous systems the tail can consume a disproportionate share of iteration time and directly slows RL training.
Seer system architecture
Seer preserves the synchronous GRPO algorithm and on policy guarantees. Training still uses only data from the current rollout iteration and distributed optimization is performed with Megatron. Inference during rollout uses an in house vLLM implementation. The key infrastructure enabler is a Global KVCache Pool based on Mooncake, a disaggregated KVCache architecture with a two tier DRAM and SSD store shared across inference nodes. This allows Seer to migrate requests between instances without recomputing prefills.
On top of this substrate Seer introduces three coordinated mechanisms orchestrated by a Request Buffer, a Context Manager and an Inference Engine Pool connected to the Global KVCache Pool:
- Divided Rollout
- Context Aware Scheduling
- Adaptive Grouped Speculative Decoding
Divided rollout and fine grained scheduling
Traditional synchronous rollout assigns whole groups of requests that share a prompt to a single instance. Because requests in a group can vary widely in output length, this approach creates load imbalance and long running stragglers. Seer decomposes each group into individual requests, and further splits each request into multiple chunks based on generation length. The scheduler dispatches a chunk with a small max tokens setting, for example 8,000 tokens. After each chunk the request is re enqueued until it finishes or reaches its original limit.
Because KVCache is stored in the Global KVCache Pool, divided request chunks can migrate between instances at chunk boundaries without rerunning the prefill. The scheduler controls concurrency to keep memory utilization high while avoiding preemption. This chunking and migration smooths KVCache usage across the iteration and reduces wasted GPU time.
Context aware scheduling using group length statistics
The team observed that requests in the same group tend to have correlated output lengths. Seer exploits this as online context. For each prompt group Seer designates one request as speculative. Speculative requests are queued at high priority and served with a smallest first policy based on tokens generated so far. Short requests finish quickly and exit, while long requests remain and reveal potential tail groups.
The Context Manager maintains a length estimate per group, updating it to the maximum generated length among completed requests. If no request has finished the manager falls back to the original max tokens as a conservative bound. After speculative requests are in flight or done, Seer schedules remaining requests with an approximate longest first policy at the group level. This approach produces throughput and tail behavior close to an oracle scheduler that knows output lengths in advance.
Adaptive grouped speculative decoding
To accelerate decoding for long tail requests Seer adds Adaptive Grouped Speculative Decoding and a Distributed Grouped Draft Server DGDS. DGDS keeps a Compressed Suffix Tree per group and aggregates token sequences from all requests in that group. Instances append generated tokens asynchronously, fetch updated suffix trees periodically and run local speculative decoding based on shared pattern statistics.
Seer adapts draft length and number of paths based on model architecture, batch size and measured acceptance length. For dense and mixture of experts models the system precomputes speculation thresholds to bound draft depth per batch. In late tail stages, when concurrency is low, Seer increases draft depth and enables multipath drafting to raise accepted tokens per step.
Measured impact on RL training
Ablation studies show divided rollout alone yields up to 35 percent throughput improvement over the vLLM synchronous baseline. Adding Context Aware Scheduling raises this to up to 47 percent, and enabling grouped speculative decoding produces total speedups of 77 to 87 percent in evaluated iterations.
Across end to end RL tasks built on Moonlight, Qwen2 VL 72B and Kimi K2, Seer improves rollout throughput by 74 to 97 percent relative to a strong synchronous veRL baseline using vLLM, and reduces tail latency by 75 to 93 percent. For memory constrained tasks where the baseline spent up to half of rollout time on the last 10 percent of requests, Seer removes most of the tail through chunking, migration and adaptive speculation on top of the Global KVCache Pool.
Key takeaways
- The rollout phase often dominates synchronous RL iteration time and is driven by long tail requests and KVCache fragmentation.
- Seer preserves on policy guarantees while restructuring rollout via divided rollout, context aware scheduling and adaptive grouped speculative decoding.
- A global disaggregated KVCache pool lets Seer migrate chunks without redoing expensive prefills and maintain high GPU utilization.
- Online context at the group level enables approximate oracle scheduling that sharply reduces tail latency and raises throughput.
For full technical details see the paper on arXiv: https://arxiv.org/pdf/2511.14617
Сменить язык
Читать эту статью на русском