Vision-Language-Action Models for Robotics

The field of robotics is moving towards the development of generalist robots that can perform a wide range of tasks. Recent research has focused on vision-language-action (VLA) models, which have shown strong generalization capabilities for action prediction. These models are being improved by incorporating multi-sensor perception, pixel-level understanding, and physically-grounded spatial intelligence. Notable advancements include the development of models that can learn from minimal real-world experiences, adapt to new embodiments, and perform tasks with higher success rates. Some papers have also explored the use of latent action models, dual-level action representation frameworks, and unified vision-motion representations to improve the performance of VLA models. Overall, the field is advancing towards more versatile and scalable VLA learning across diverse robots, tasks, and environments. Noteworthy papers include: Maestro, which introduces a VLM coding agent that dynamically composes modules into a programmatic policy for the current task and scenario. OmniVLA, which integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. LACY, which learns bidirectional mappings within a single vision-language model, enabling a self-improving cycle that autonomously generates and filters new training data. XR-1, which introduces a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion.

Vision-Language-Action Models for Robotics

Sources