DeepConf: Meta AI's Confidence-Driven Method Hits 99.9% on AIME 2025 with GPT-OSS-120B

Why DeepConf?

Large language models have advanced reasoning through techniques like parallel thinking and self-consistency, but those approaches force a trade-off: generating many reasoning traces improves accuracy but dramatically increases compute cost. Deep Think with Confidence (DeepConf), developed by researchers at Meta AI and UCSD, sidesteps that trade-off by using the model's internal confidence to prefer and keep high-quality traces while discarding or early-stopping low-confidence ones. The result is near state-of-the-art accuracy with far fewer generated tokens.

How DeepConf measures and uses confidence

DeepConf introduces several complementary confidence metrics that operate at token and segment levels:

Token Confidence: negative average log-probability of the top-k token candidates, giving a local certainty signal.
Group Confidence: averaged token confidence over a sliding window (for example, 2048 tokens) to smooth fluctuations.
Tail Confidence: a focused score on the final segment of a trace where the answer is most likely produced.
Lowest Group Confidence: identification of the least confident segment to detect collapse points in reasoning.
Bottom Percentile Confidence: emphasis on the worst segments, which are most predictive of errors.

These signals are used in two main ways: weight-by-confidence voting, where higher-confidence traces influence the final answer more, and confidence-based filtering, where only the top eta percent of traces are retained. In online mode DeepConf can early-stop generation for a trace as soon as its confidence falls beneath a dynamic threshold, eliminating wasted token generation.

Key results: accuracy and efficiency

DeepConf was evaluated across reasoning benchmarks such as AIME 2024/2025, HMMT 2025, BRUMO25, and GPQA-Diamond, using models including DeepSeek-8B, Qwen3-8B/32B, and GPT-OSS-20B/120B. Highlights include:

GPT-OSS-120B on AIME 2025: accuracy rose from 91.8% (standard pass@1) and 97.0% (consensus@512) to 99.9% with DeepConf@512, while reducing generated tokens by about 84.7%.
Across datasets and models DeepConf improved accuracy by up to ~10 percentage points over majority-vote self-consistency methods and often reached dataset ceilings.
Token efficiency gains ranged from about 43% to 85% by early-stopping low-confidence traces with no accuracy loss and frequently with accuracy improvement.

Deployment and integration

DeepConf is model-agnostic and operates entirely at inference time. It requires access to token-level log probabilities and a small amount of logic to compute sliding-window confidences and enforce early-stop checks. For vLLM, the integration is minimal: extend the logprobs processor to track windowed confidence, add an early-stop check before emitting outputs, and pass confidence thresholds via the API. The authors report being able to add DeepConf to existing serving stacks with roughly 50 lines of code.

Practical implications

Plug-and-play: no model retraining, fine-tuning, or hyperparameter search is required.
Cost reduction: dramatically fewer tokens generated translates to lower latency and compute cost in production.
Robustness: filtering out low-confidence reasoning traces reduces the impact of degraded or spurious chains of thought that can dilute majority votes.

DeepConf: Meta AI's Confidence-Driven Method Hits 99.9% on AIME 2025 with GPT-OSS-120B

Сменить язык