Selective High-Entropy Token Training Boosts LLM Reasoning and Cuts Costs

Understanding Token Entropy in LLM Reasoning

Large Language Models (LLMs) generate complex step-by-step outputs called Chain-of-Thoughts (CoTs). Each token contributes to a logical narrative, but not all tokens equally influence the reasoning process. Token entropy measures the uncertainty in predicting each token, revealing which tokens represent critical decision points in reasoning.

Limitations of Uniform Token Training in Reinforcement Learning

Traditional reinforcement learning with verifiable rewards (RLVR) trains models by treating all tokens uniformly during policy updates. Methods like Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Dynamic sAmpling Policy Optimization (DAPO) evaluate entire token sequences without distinguishing the importance of individual tokens. This approach often wastes resources on tokens that merely extend existing thoughts rather than shift reasoning paths.

High-Entropy Tokens as Forking Points

Researchers from Alibaba and Tsinghua University analyzed Qwen3 models and discovered that roughly 20% of tokens have high entropy, termed “forking tokens.” These tokens correspond to decision points where the model must choose between different reasoning paths. The other 80% of tokens show low entropy and primarily serve as extensions of prior logic.

Selective Training on High-Entropy Tokens

By limiting policy gradient updates to only the high-entropy tokens, the researchers maintained or improved performance on challenging reasoning benchmarks. They quantified token entropy using probability distributions of token choices, finding that more than half of tokens had entropy below 0.01, indicating near-deterministic prediction, while 20% had entropy above 0.672, marking them as key decision points.

Experimental Results

Experiments with Qwen3-8B, Qwen3-14B, and Qwen3-32B models showed that training only the top 20% high-entropy tokens yielded significant performance gains. The Qwen3-32B model scored 63.5 on AIME’24 and 56.7 on AIME’25, outperforming larger traditionally trained models. Extending response length from 20k to 29k tokens further improved scores. Conversely, training on the low-entropy 80% resulted in performance degradation.

Optimal Threshold and Scalability

An ablation study confirmed that the 20% threshold balances exploration and performance effectively. Reducing it to 10% omitted important decision points, while increasing it diluted the benefits by including low-entropy tokens. Larger models benefited more from this selective training due to their capacity for enhanced exploration. This strategy scales well and offers a practical approach to improving reasoning in LLMs while reducing training costs.

Key Takeaways

About 20% of tokens act as critical decision points with high entropy.
Training focused on these tokens matches or exceeds full-token training performance.
Qwen3-32B set new benchmarks on AIME’24 and AIME’25 using this method.
Extending response length further boosts performance.
Training on low-entropy tokens degrades results.
The selective approach reduces computational overhead and enhances reasoning.

This research presents a new paradigm in reinforcement learning for LLMs by aligning training efforts with the tokens that matter most for reasoning, leading to improved accuracy and efficiency.