Vision-Language-Action Models for Robotics

The field of robotics is moving towards the development of generalist robots that can perform a wide range of tasks. Recent research has focused on vision-language-action (VLA) models, which have shown strong generalization capabilities for action prediction. These models are being improved by incorporating multi-sensor perception, pixel-level understanding, and physically-grounded spatial intelligence. Notable advancements include the development of models that can learn from minimal real-world experiences, adapt to new embodiments, and perform tasks with higher success rates. Some papers have also explored the use of latent action models, dual-level action representation frameworks, and unified vision-motion representations to improve the performance of VLA models. Overall, the field is advancing towards more versatile and scalable VLA learning across diverse robots, tasks, and environments. Noteworthy papers include: Maestro, which introduces a VLM coding agent that dynamically composes modules into a programmatic policy for the current task and scenario. OmniVLA, which integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. LACY, which learns bidirectional mappings within a single vision-language model, enabling a self-improving cycle that autonomously generates and filters new training data. XR-1, which introduces a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion.

Sources

Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots

Embodiment Transfer Learning for Vision-Language-Action Models

OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

iFlyBot-VLA Technical Report

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Built with on top of