UltraCUA: Hybrid Action Agents that Mix GUI Clicks with Programmatic Tools

What UltraCUA does

UltraCUA is a foundation model for computer-use agents that combines low-level GUI primitives (clicks, keystrokes, scrolls) with high-level programmatic tool calls. Instead of forcing agents to complete long chains of primitive GUI actions, UltraCUA provides a hybrid action space where a single tool call can encapsulate a multi-step operation as a callable interface with a clear signature and docstring. When a programmatic operation is available and cheaper or more reliable, the model chooses it; when it is not, the agent falls back to GUI actions.

Hybrid action space and motivation

Treating tools as first-class actions reduces cascade errors that accumulate over long primitive action sequences. A tool call acts as a single atomic step that hides complex GUI navigation, while clicks and key presses remain available for tasks that have no programmatic pathway. The model learns to alternate between both modes, choosing the most reliable and cost-effective move at each decision point.

Scaling tool acquisition

UltraCUA builds a large reusable tool library with an automated pipeline. The system extracts keyboard shortcuts and commands from software documentation, integrates open-source tool implementations from agent toolkits, and uses coding agents to synthesize new tools. Each tool is wrapped as a callable interface that replaces a long GUI sequence. The team reports coverage across 10 desktop domains with 881 tools in total, including 135 tools for VS Code and 123 for LibreOffice Writer, plus deep coverage for Thunderbird and GIMP.

Synthetic tasks and verifiable trajectories

Training required grounded supervision and stable rewards, so the researchers designed a dual synthetic engine. One pipeline composes atomic verifiers for browsers, files, images, and system state and then generates tasks that satisfy those checks. The other pipeline explores the OS to propose context-aligned tasks that are then verified. This process yields 17,864 verifiable tasks spanning 10 domains, including Chrome, LibreOffice, GIMP, VS Code, Thunderbird, VLC, and multi-application workflows. For example, Chrome accounts for 2,826 tasks and the LibreOffice suite totals 5,885, with 2,113 multi-app tasks.

Rollouts and supervised data

A multi-agent rollout process produces successful hybrid trajectories. The planner uses OpenAI o3 for decision making and the grounder relies on GTA1-7B for visual localization. The rollout produced approximately 26.8K successful trajectories that demonstrate when to call a tool versus when to operate directly on the GUI. These trajectories form the core supervised dataset for the first training stage.

Two-stage training pipeline

Training proceeds in two stages. Stage 1 is supervised fine-tuning on successful hybrid trajectories: the models train for three epochs with a learning rate of 2e-5 and use turn-wise loss to avoid overweighting early steps. Stage 2 is online reinforcement learning: models train for 150 steps at a learning rate of 1e-6 on verified tasks sampled by difficulty. Policy optimization follows a GRPO variant with a higher clip, and the approach removes KL regularization and format rewards. The reward combines sparse task outcome signals with a tool-use term. Experiments ran on NVIDIA H100 GPUs and the context window was held near 32K tokens by limiting the number of exposed tools.

Results on OSWorld and efficiency gains

UltraCUA shows consistent gains over GUI-only and other baselines at both 7B and 32B scales. Under a 15-step budget, UltraCUA-32B reached 41.0% success versus 29.7% for OpenCUA-32B, an 11.3 percentage point absolute improvement. UltraCUA-7B hit 28.9% compared to UI-TARS-1.5-7B at 23.4%. Gains persist under larger budgets and across domains such as Chrome, Writer, and VS Code. Average step counts drop compared to baselines, indicating better action selection rather than simply more attempts.

Cross-platform generalization

Although trained only on Ubuntu-based OSWorld data, UltraCUA transfers to WindowsAgentArena in a zero-shot fashion. UltraCUA-7B achieves 21.7% success on WindowsAgentArena, outperforming UI-TARS-1.5-7B at 18.1% and a Qwen2 baseline trained on Windows data at 13.5%. This suggests the hybrid action policy generalizes across operating systems.

Why this matters

UltraCUA formalizes a practical bridge between general-purpose GUI agents and specialized API-centric agents by letting a single policy interleave programmatic tool calls with GUI primitives. The combination of a scalable, automated tool library and a verifiable synthetic task engine enables grounded supervised and reinforcement learning at scale, which yields measurable improvements in reliability and efficiency on desktop automation benchmarks.

UltraCUA: Hybrid Action Agents that Mix GUI Clicks with Programmatic Tools

Сменить язык