VeBrain: Revolutionizing Robotics with Unified Multimodal AI for Vision and Control

Bridging Perception and Action in Robotics

Multimodal Large Language Models (MLLMs) offer significant potential for empowering robots—such as robotic arms and legged robots—to perceive their environment, understand complex scenarios, and carry out meaningful actions. Integrating such intelligence into physical machines is pushing robotics toward autonomous systems that not only see and describe but also plan and navigate their surroundings with contextual awareness.

Challenges in Combining Vision, Reasoning, and Physical Control

Despite advancements in MLLMs, a major challenge remains: effectively merging vision, reasoning, and physical interaction into a single coherent system. Traditionally, models trained to interpret images or text struggle with controlling robots in real-world environments. Understanding a scene differs fundamentally from acting within it. While multimodal understanding emphasizes perception and analysis, physical control demands precise, real-time decisions based on that understanding. This gap limits the creation of agents capable of observing, reasoning, and acting seamlessly in diverse contexts.

Limitations of Existing Vision-Language-Action Models

Previous robot control approaches rely on vision-language-action (VLA) models trained on extensive datasets to convert visual input into control commands. Some attempt to retain MLLM reasoning by translating commands into text-based actions, but they often lose accuracy and adaptability during execution. VLAs generally degrade in long-term or varied robotic tasks and struggle to generalize across different robots or environments due to the disconnect between image-based comprehension and motor control.

Introducing VeBrain: A Unified Multimodal Framework

Researchers from Shanghai AI Laboratory, Tsinghua University, and SenseTime Research, alongside other collaborators, have developed VeBrain (Visual Embodied Brain), a unified framework that reformulates robot control as text-based tasks within a 2D visual context, aligning closely with MLLM operations. VeBrain integrates multimodal perception, spatial reasoning, and robotic control into a single architecture. A specialized robotic adapter converts MLLM outputs into executable movement policies, enabling a single model to handle perception, reasoning, and control collectively.

VeBrain is supported by VeBrain-600k, a high-quality instruction dataset containing over 600,000 multimodal task samples, including robot motions and reasoning steps.

Architecture and Robotic Adapter Components

VeBrain builds on the Qwen2.5-VL architecture, enhanced with real-world control capabilities. Its robotic adapter features four main modules:

Point Tracker: Updates 2D keypoints dynamically as the robot's viewpoint shifts, ensuring precise targeting.
Movement Controller: Converts 2D keypoints into 3D movements by integrating image data with depth information.
Skill Executor: Maps predicted commands like "turn" or "grasp" to pre-trained robotic skills.
Dynamic Takeover Module: Detects errors or anomalies, reverting control to the MLLM when necessary.

These components create a closed-loop system that enables continuous decision-making, action, and self-correction, allowing robots to operate effectively in complex and varied environments.

Performance Across Benchmarks

VeBrain was tested on 13 multimodal and 5 spatial benchmarks, achieving notable improvements over prior models. It outperformed Qwen2.5-VL by 5.6% on MMVet, scored 101.5 on the CIDEr metric for ScanQA, and achieved 83.7 on MMBench. On the VSI benchmark, VeBrain averaged 39.9, surpassing Qwen2.5-VL’s 35.9.

In robotic tasks, VeBrain succeeded 86.4% of the time on seven-legged robot challenges, significantly better than VLA and π0 models, which scored 32.1% and 31.4%, respectively. For robotic arm tasks, it achieved a 74.3% success rate, outperforming others by up to 80%. These results demonstrate VeBrain’s strong capability in handling long-horizon and spatially complex robot control tasks reliably.

Advancing Embodied AI

VeBrain redefines robot control as a language-driven task, seamlessly combining high-level reasoning with low-level action. This approach bridges the gap between visual understanding and physical execution, offering a functional and scalable solution. With its robust design and impressive results, VeBrain marks a significant step towards unified, intelligent robotic systems capable of autonomous operation across diverse environments and tasks.

For more details, check out the Paper and GitHub Page. Follow the research developments on Twitter and join the 99k+ ML SubReddit. Subscribe to the newsletter for the latest updates.