OpenAI Launches GPT-5.2: Advanced Agent and Coding Model
Discover the capabilities and benchmarks of OpenAI's new GPT-5.2 model tailored for agents and coding.
Overview of GPT-5.2
OpenAI has just introduced GPT-5.2, its most advanced frontier model for professional work and long-running agents, being rolled out across ChatGPT and the API.
GPT-5.2 consists of three variants: In ChatGPT, users interact with ChatGPT-5.2 Instant, Thinking, and Pro. In the API, users have access to the models gpt-5.2-chat-latest, gpt-5.2, and gpt-5.2-pro. The Instant variant focuses on everyday assistance, Thinking addresses complex multi-step tasks and agents, while Pro allocates greater compute resources for technical and analytical challenges.
Benchmark Profiling with GDPval and SWE Bench
GPT-5.2 Thinking is designed as the primary workhorse for real-world knowledge tasks. Evaluated on GDPval, a benchmark assessing knowledge tasks across various industries, it either beats or ties industry professionals in 70.9% of comparisons. Moreover, it operates at over 11 times the speed of experts while maintaining costs below 1% of their estimated expenses. This reliability enables teams to generate essential artifacts like presentations, spreadsheets, schedules, and diagrams based on structured instructions.
Internal evaluations show significant gains in tallying scores for junior investment banking tasks, moving from 59.1% with GPT-5.1 to 68.4% with GPT-5.2 Thinking and 71.7% with GPT-5.2 Pro.
In software engineering, GPT-5.2 Thinking scores 55.6% on SWE-Bench Pro and 80.0% on SWE-Bench Verified, with the latter concentrating on Python patch generation.
Long Context Capabilities and Agentic Workflows
Long context management is a key advancement in GPT-5.2. The Thinking variant achieves state-of-the-art results on the OpenAI MRCRv2 benchmark, which assesses the model's ability to replicate correct answers within extensive dialogue contexts, reaching near 100% accuracy for queries within 256k tokens.
For extensive workflows exceeding this limit, GPT-5.2 Thinking integrates a Responses /compact endpoint designed for context compaction, particularly useful in maintaining state during multi-step tool processes.
On tool usage, GPT-5.2 Thinking achieves 98.7% on Tau2-bench Telecom, a benchmark where realistic workflows are analyzed, depicting handling situations like rebooking flights.
Enhancements in Vision, Science, and Math
The vision capabilities of GPT-5.2 have improved, halving error rates in chart reasoning benchmarks. Its spatial understanding allows for better image identification, significantly enhancing performance over GPT-5.1.
In scientific evaluations, GPT-5.2 Pro scored 93.2% and GPT-5.2 Thinking 92.4% on GPQA Diamond, showcasing proficiency in graduate-level subjects and complex mathematical proofs.
Comparison of Key Models
| Model | Primary Positioning | Context Window | Knowledge Cutoff | Notable Benchmarks | |--------------------|---------------------------------------------------------|------------------|------------------|-------------------------------------------------------------| | GPT-5.1 | Flagship for coding and agent tasks | 400,000 tokens | 2024-09-30 | SWE-Bench Pro 50.8%, SWE-bench Verified 76.3% | | GPT-5.2 (Thinking) | Main model for coding and long-running agents | 400,000 tokens | 2025-08-31 | GDPval wins 70.9%, SWE-Bench Pro 55.6%, SWE-bench Verified 80.0% | | GPT-5.2 Pro | Enhances reasoning and scientific capabilities | 400,000 tokens | 2025-08-31 | GPQA Diamond 93.2% vs 92.4% for GPT-5.2 Thinking |
Key Takeaways
- GPT-5.2 Thinking is the updated workhorse model: It supersedes GPT-5.1 Thinking for coding and knowledge work with significantly improved benchmark results while maintaining similar context limits.
- Substantial accuracy improvement: With scores rising considerably across various benchmarks, GPT-5.2 Thinking shows marked advancements yet retains comparable token limits.
- GPT-5.2 Pro targets advanced reasoning: Aimed at high-end reasoning tasks, it excels particularly in scientific evaluations, yielding superior results compared to both GPT-5.1 and the Thinking variant.
Сменить язык
Читать эту статью на русском