QeRL Unlocks 32B RL Training on One H100 with NVFP4, Faster Rollouts and Better Exploration

QeRL is an open framework that brings weight-only 4-bit NVFP4 quantization into the Reinforcement Learning post-training loop while preserving stable updates with LoRA. The approach targets the rollout stage, where token generation consumes most wall-clock time, and uses hardware-efficient FP4×BF16 kernels to accelerate sampling without keeping a separate full-precision policy.

How NVFP4 and LoRA are combined

QeRL quantizes model weights to NVFP4 (FP4) using dual-level scaling and keeps logits and gradient math at higher precision via LoRA updates. The rollout and prefill paths use Marlin-based FP4 kernels so sampling can run with BF16-level accuracy and much lower memory footprint. Backpropagation remains stable because the trainable changes are constrained to LoRA modules that are computed in higher precision.

Adaptive Quantization Noise as a controlled exploration signal

A key empirical observation is that deterministic FP4 quantization increases policy entropy early in training, flattening token distributions and promoting exploration compared with 16-bit LoRA or NF4-based QLoRA. QeRL makes this exploration effect schedulable by introducing Adaptive Quantization Noise (AQN): channel-wise Gaussian perturbations mapped into LayerNorm scale parameters with an exponential annealing schedule. AQN preserves kernel fusion and requires no extra weight tensors while transitioning the policy from exploration to exploitation.

Integration and performance advantages

The implementation integrates Marlin FP4 kernels in rollout and prefill and restricts trainable capacity with LoRA, directly reducing the cost and latency of the stage that dominates RLHF-style pipelines for long reasoning traces. Reported performance highlights include greater than 1.5× speedups in rollout, about 1.8× end-to-end speedup versus QLoRA in a representative setting, and more than 2× rollout throughput on 14B/32B models versus QLoRA in some benchmarks.

Reported accuracy and memory wins

On Qwen2.5 experiments the team found NVFP4+LoRA matches or exceeds higher-precision baselines on math reasoning: for a 7B model they report GSM8K = 90.8% and MATH500 = 77.4%, surpassing 16-bit LoRA and QLoRA under their setup and matching full-parameter fine-tuning. The memory savings from weight-only FP4 enabled training a 32B policy with GRPO on a single H100-80GB, the first such demonstration according to the authors.

Where QeRL applies and its limits

QeRL is weight-only FP4 with LoRA updates and does not claim FP4 precision for logits or gradients. Its main benefits are rollout/prefill throughput and reduced memory footprint, and its observed exploration benefits stem from quantization-induced entropy modulated by AQN. Generalization beyond the reported math-reasoning tasks or to safety- or tool-use RL will depend on reward design, sequence lengths, and availability of NVFP4 kernel support such as Marlin.

Key takeaways

QeRL pairs NVFP4 4-bit weight quantization with LoRA to accelerate rollout and reduce memory, enabling RL for a 32B LLM on a single H100-80GB.
Deterministic FP4 quantization raises policy entropy early in training; AQN schedules channel-wise noise via LayerNorm scales to control exploration.
Reported efficiency highlights: >1.5× rollout speedups versus 16-bit LoRA, ~1.8× end-to-end versus QLoRA in a representative setup, and >2× rollout throughput versus QLoRA on 14B/32B setups.
Accuracy is competitive: Qwen2.5-7B reached 90.8% GSM8K and 77.4% MATH500 in the paper's experiments.

For full technical details, code, notebooks and reproduction instructions, see the paper and the project's repository linked from the authors' announcement.

QeRL Unlocks 32B RL Training on One H100 with NVFP4, Faster Rollouts and Better Exploration

Сменить язык