<RETURN_TO_BASE

DeepConf: Meta AI's Confidence-Driven Method Hits 99.9% on AIME 2025 with GPT-OSS-120B

'Meta AI and UCSD's DeepConf uses token-level confidence to reach 99.9% on AIME 2025 with GPT-OSS-120B while reducing generated tokens by up to 85%, delivering higher accuracy at far lower compute cost.'

Why DeepConf?

Large language models have advanced reasoning through techniques like parallel thinking and self-consistency, but those approaches force a trade-off: generating many reasoning traces improves accuracy but dramatically increases compute cost. Deep Think with Confidence (DeepConf), developed by researchers at Meta AI and UCSD, sidesteps that trade-off by using the model's internal confidence to prefer and keep high-quality traces while discarding or early-stopping low-confidence ones. The result is near state-of-the-art accuracy with far fewer generated tokens.

How DeepConf measures and uses confidence

DeepConf introduces several complementary confidence metrics that operate at token and segment levels:

  • Token Confidence: negative average log-probability of the top-k token candidates, giving a local certainty signal.
  • Group Confidence: averaged token confidence over a sliding window (for example, 2048 tokens) to smooth fluctuations.
  • Tail Confidence: a focused score on the final segment of a trace where the answer is most likely produced.
  • Lowest Group Confidence: identification of the least confident segment to detect collapse points in reasoning.
  • Bottom Percentile Confidence: emphasis on the worst segments, which are most predictive of errors.

These signals are used in two main ways: weight-by-confidence voting, where higher-confidence traces influence the final answer more, and confidence-based filtering, where only the top eta percent of traces are retained. In online mode DeepConf can early-stop generation for a trace as soon as its confidence falls beneath a dynamic threshold, eliminating wasted token generation.

Key results: accuracy and efficiency

DeepConf was evaluated across reasoning benchmarks such as AIME 2024/2025, HMMT 2025, BRUMO25, and GPQA-Diamond, using models including DeepSeek-8B, Qwen3-8B/32B, and GPT-OSS-20B/120B. Highlights include:

  • GPT-OSS-120B on AIME 2025: accuracy rose from 91.8% (standard pass@1) and 97.0% (consensus@512) to 99.9% with DeepConf@512, while reducing generated tokens by about 84.7%.
  • Across datasets and models DeepConf improved accuracy by up to ~10 percentage points over majority-vote self-consistency methods and often reached dataset ceilings.
  • Token efficiency gains ranged from about 43% to 85% by early-stopping low-confidence traces with no accuracy loss and frequently with accuracy improvement.

Deployment and integration

DeepConf is model-agnostic and operates entirely at inference time. It requires access to token-level log probabilities and a small amount of logic to compute sliding-window confidences and enforce early-stop checks. For vLLM, the integration is minimal: extend the logprobs processor to track windowed confidence, add an early-stop check before emitting outputs, and pass confidence thresholds via the API. The authors report being able to add DeepConf to existing serving stacks with roughly 50 lines of code.

Practical implications

  • Plug-and-play: no model retraining, fine-tuning, or hyperparameter search is required.
  • Cost reduction: dramatically fewer tokens generated translates to lower latency and compute cost in production.
  • Robustness: filtering out low-confidence reasoning traces reduces the impact of degraded or spurious chains of thought that can dilute majority votes.

Further reading

Full technical details, experiments, and implementation notes are available in the paper: https://arxiv.org/pdf/2508.15260 and on the project's GitHub and supporting pages.

FAQs

Q: How does DeepConf improve both accuracy and efficiency compared to majority voting? A: By prioritizing or keeping only higher-confidence traces, DeepConf raises accuracy and uses early stopping to avoid generating wasted low-confidence tokens, yielding both higher performance and lower token usage.

Q: Is DeepConf tied to a specific model or framework? A: No. DeepConf is fully model-agnostic and can be integrated into any serving stack that exposes token log probabilities, including open-source and commercial endpoints.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский