Sigmoid Scaling Lets Teams Predict RL Post-Training Returns for LLMs

Predictability gap in RL post-training

Reinforcement learning (RL) post-training has become a critical tool for reasoning-centric large language models, but until now teams lacked reliable scaling rules to forecast returns. Groups have been spending tens of thousands of GPU-hours without a principled way to estimate whether more compute will keep delivering gains. A multi-institution study from Meta, UT Austin, UCL, Berkeley, Harvard, and Periodic Labs provides a compute-performance framework validated over more than 400,000 GPU-hours that models RL progress with a sigmoidal curve and supplies a tested recipe, ScaleRL, that follows those predicted curves up to 100,000 GPU-hours.

Sigmoids, not power laws

Pre-training often obeys power laws between loss and compute. RL fine-tuning, however, typically optimizes bounded metrics such as pass rate or mean reward, for which power-law fits are unstable when extrapolating from small runs. The study shows that fitting a sigmoid to pass rate versus training compute is empirically more robust and produces stable extrapolations, once the very early noisy regime (roughly the first 1.5k GPU-hours) is excluded. The sigmoid parameters are intuitive: one controls the asymptotic performance (ceiling), another the efficiency/exponent (how steep gains are), and a third the midpoint where improvements are fastest.

Why this matters in practice: after about 1–2k GPU-hours you can fit the sigmoidal curve and forecast whether extending to 10k–100k GPU-hours is likely to be worth the budget. By contrast, power-law fits can suggest misleading ceilings unless you only fit at very high compute, which defeats the purpose of early forecasting. See the paper: https://arxiv.org/pdf/2510.13786

ScaleRL: a recipe that scales predictably

ScaleRL is not a single algorithmic novelty but a composition of engineering and loss choices that produced stable, extrapolatable scaling in the study. Key elements include:

The team validated each component with leave-one-out (LOO) ablations at 16k GPU-hours and showed that ScaleRL’s fitted curves reliably extrapolate from 8k -> 16k, then hold at much larger scales, including a single run extended to 100k GPU-hours. See the experiments and methods in the paper: https://arxiv.org/pdf/2510.13786

Validation across models and tasks

Two key demonstrations support generalization. First, for an 8B dense model and a Llama-4 17B×16 MoE model called ‘Scout’, extended training closely followed sigmoid extrapolations derived from smaller-compute segments. Second, pass-rate improvements on an iid validation set tracked downstream evaluation (for example, AIME-24), suggesting that the compute-performance curve is not an artifact of the validation dataset.

The study also compares ScaleRL against other prevalent recipes (for example, DeepSeek (GRPO), Qwen-2.5 (DAPO), Magistral, MiniMax-M1) and reports higher asymptotic performance and better compute efficiency for ScaleRL in their setups.

Which knobs move the ceiling and which shape efficiency

The framework lets teams classify design choices by their primary effect:

Operational advice from the paper: fit sigmoids early after the noisy startup region, prioritize interventions that raise the ceiling, then tune efficiency knobs to reach that ceiling faster at fixed compute.

Key operational takeaways