Grok-4-Fast: xAI's Unified 2M-Token Model With Built-In Tool-Use Reinforcement Learning

What Grok-4-Fast is

xAI has introduced Grok-4-Fast, a cost-optimized successor to Grok-4 designed to combine both 'reasoning' and 'non-reasoning' behaviors into a single model. Instead of switching between distinct models for short answers and long reasoning chains, Grok-4-Fast uses one weight space and lets system prompts steer its behavior. The model offers a 2 million token context window and native tool-use reinforcement learning that can choose when to browse the web, run code, or call external tools.

Unified architecture and prompt steering

Earlier Grok variants separated long-chain reasoning and quick, short-form responses into different models. Grok-4-Fast removes that split by unifying both behaviors in one set of weights. The benefit for real-time applications is twofold: it reduces latency caused by model switching and lowers token consumption, which translates directly into cost savings for high-throughput scenarios like search, interactive coding, and assistive agents. System prompts determine whether the model behaves in a 'reasoning' mode or a 'non-reasoning' mode, making behavior control simpler and faster.

Tool-use reinforcement learning and agent performance

Grok-4-Fast was trained end-to-end with tool-use reinforcement learning, teaching the model when to call tools such as web browsing, code execution, or other API endpoints. That training shows up in agent-oriented benchmarks: BrowseComp 44.9%, SimpleQA 95.0%, and Reka Research 66.0%. xAI also reports gains on Chinese variants, for example BrowseComp-zh 51.2%.

xAI further cites private battle-testing on LMArena: the search-focused grok-4-fast-search (codename 'menlo') ranked #1 in the Search Arena with 1163 Elo, while the text variant (codename 'tahoe') placed #8 in the Text Arena, roughly comparable to grok-4-0709.

Efficiency, benchmarks, and 'intelligence density'

On both internal and public benchmarks Grok-4-Fast posts frontier-class scores while cutting token usage. Reported pass@1 results include 92.0% on AIME 2025 (no tools), 93.3% on HMMT 2025 (no tools), 85.7% on GPQA Diamond, and 80.0% on LiveCodeBench (Jan–May). xAI claims that Grok-4-Fast uses about 40% fewer 'thinking' tokens on average compared to Grok-4, and frames this improvement as higher 'intelligence density'. When combined with new per-token pricing, xAI estimates roughly a 98% reduction in the price required to reach the same benchmark performance as Grok-4.

Deployment options and pricing

Grok-4-Fast is generally available in Grok's Fast and Auto modes across web and mobile. In Auto mode, the system will select Grok-4-Fast for difficult queries to improve latency without sacrificing quality. For the first time, free users gain access to xAI's newest model tier.

Developers can access two SKUs: grok-4-fast-reasoning and grok-4-fast-non-reasoning, both offering the 2M-token context window. xAI's API pricing is tiered by context length and token direction: $0.20 per 1M input tokens for contexts under 128k, $0.40 per 1M input tokens for contexts at or above 128k, $0.50 per 1M output tokens for outputs under 128k, $1.00 per 1M output tokens for outputs at or above 128k, and $0.05 per 1M cached input tokens.

Why this matters for search and agentic workflows

The combination of a very large context window, prompt-steerable unified behavior, and tool-use RL positions Grok-4-Fast for high-throughput search and agent tasks. Reductions in both latency and token usage make it attractive for production deployments where per-query cost and responsiveness are critical. Early public signals, like the LMArena placements, and the benchmark profile suggest xAI has achieved comparable accuracy to Grok-4 with materially lower per-query token consumption.

Technical highlights

Unified model with a 2M-token context window across both SKUs.
End-to-end training with tool-use reinforcement learning to decide when to call tools.
Reported efficiency of roughly 40% fewer thinking tokens versus Grok-4.
API pricing designed for scale with cached input token discounts.

Grok-4-Fast packages Grok-4-level capabilities into a single, prompt-steerable model optimized for search and agent workloads, aiming to lower latency and unit costs in production environments. For more details consult xAI's technical notes and developer documentation.