Sigmoid Scaling Lets Teams Predict RL Post-Training Returns for LLMs

Predictability gap in RL post-training

Reinforcement learning (RL) post-training has become a critical tool for reasoning-centric large language models, but until now teams lacked reliable scaling rules to forecast returns. Groups have been spending tens of thousands of GPU-hours without a principled way to estimate whether more compute will keep delivering gains. A multi-institution study from Meta, UT Austin, UCL, Berkeley, Harvard, and Periodic Labs provides a compute-performance framework validated over more than 400,000 GPU-hours that models RL progress with a sigmoidal curve and supplies a tested recipe, ScaleRL, that follows those predicted curves up to 100,000 GPU-hours.

Sigmoids, not power laws

Pre-training often obeys power laws between loss and compute. RL fine-tuning, however, typically optimizes bounded metrics such as pass rate or mean reward, for which power-law fits are unstable when extrapolating from small runs. The study shows that fitting a sigmoid to pass rate versus training compute is empirically more robust and produces stable extrapolations, once the very early noisy regime (roughly the first 1.5k GPU-hours) is excluded. The sigmoid parameters are intuitive: one controls the asymptotic performance (ceiling), another the efficiency/exponent (how steep gains are), and a third the midpoint where improvements are fastest.

Why this matters in practice: after about 1–2k GPU-hours you can fit the sigmoidal curve and forecast whether extending to 10k–100k GPU-hours is likely to be worth the budget. By contrast, power-law fits can suggest misleading ceilings unless you only fit at very high compute, which defeats the purpose of early forecasting. See the paper: https://arxiv.org/pdf/2510.13786

ScaleRL: a recipe that scales predictably

ScaleRL is not a single algorithmic novelty but a composition of engineering and loss choices that produced stable, extrapolatable scaling in the study. Key elements include:

Asynchronous Pipeline RL (generator–trainer split across GPUs) for high off-policy throughput
CISPO (truncated importance-sampling REINFORCE) as the RL loss
FP32 precision at the logits to avoid numeric mismatch between generator and trainer
Prompt-level loss averaging and batch-level advantage normalization
Forced length interruptions to cap runaway traces
Zero-variance filtering (drop prompts that provide no gradient signal)
No-Positive-Resampling (remove high-pass-rate prompts >=0.9 from later epochs)

The team validated each component with leave-one-out (LOO) ablations at 16k GPU-hours and showed that ScaleRL's fitted curves reliably extrapolate from 8k -> 16k, then hold at much larger scales, including a single run extended to 100k GPU-hours. See the experiments and methods in the paper: https://arxiv.org/pdf/2510.13786

Validation across models and tasks

Two key demonstrations support generalization. First, for an 8B dense model and a Llama-4 17B×16 MoE model called 'Scout', extended training closely followed sigmoid extrapolations derived from smaller-compute segments. Second, pass-rate improvements on an iid validation set tracked downstream evaluation (for example, AIME-24), suggesting that the compute-performance curve is not an artifact of the validation dataset.

The study also compares ScaleRL against other prevalent recipes (for example, DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, MiniMax-M1) and reports higher asymptotic performance and better compute efficiency for ScaleRL in their setups.

Which knobs move the ceiling and which shape efficiency

The framework lets teams classify design choices by their primary effect:

Ceiling movers (raise the asymptote): increasing model scale (for example, MoE), using much longer generation lengths (up to 32,768 tokens), and larger global batch sizes can all lift the final asymptotic performance, although they may slow early gains.
Efficiency shapers (speed how fast you reach the ceiling): loss aggregation, advantage normalization, data curriculum, and the off-policy pipeline mainly change how quickly the curve rises toward the ceiling, not the ceiling itself.

Operational advice from the paper: fit sigmoids early after the noisy startup region, prioritize interventions that raise the ceiling, then tune efficiency knobs to reach that ceiling faster at fixed compute.

Key operational takeaways

Model RL post-training progress with sigmoidal compute-performance curves (pass rate vs log compute) for reliable extrapolation.
Use ScaleRL's composition of pipeline, loss, precision, and data handling choices to achieve stable scaling behavior.
Fit curves early (after ~1–2k GPU-hours) to forecast returns and make budget decisions before burning compute. The team validated predictions up to 100k GPU-hours for an 8B dense model and ~50k GPU-hours for a 17B×16 MoE 'Scout'.