Advances in Vision-Language-Action Models for Robotic Manipulation

The field of Vision-Language-Action (VLA) models is rapidly advancing, with a focus on improving the ability of robots to understand and execute complex tasks. Recent developments have centered around enhancing the generalization and fine-grained control of VLA models, as well as increasing their robustness to diverse tasks, scenes, and camera viewpoints. Notable progress has been made in the use of discrete diffusion frameworks, continuized discrete diffusion, and mixture of horizons strategies to improve action generation and control. Additionally, researchers have explored the use of unified vision-language-action frameworks, such as MobileVLA-R1 and MergeVLA, to enable more effective and efficient robotic manipulation. The development of new hardware designs, such as the hybrid end effector presented in VacuumVLA, has also expanded the range of feasible tasks for VLA models. Some noteworthy papers in this area include QuickLAP, which introduces a Bayesian framework for fusing physical and language feedback to infer reward functions in real-time, and Mixture of Horizons, which proposes a strategy for mitigating the trade-off between longer horizons providing stronger global foresight and shorter horizons degrading fine-grained accuracy. Other notable papers include MobileVLA-R1, which presents a unified vision-language-action framework for quadruped robots, and MergeVLA, which introduces a merging-oriented VLA architecture that preserves mergeability by design.

Sources

QuickLAP: Quick Language-Action Preference Learning for Autonomous Driving Agents

MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots

Continually Evolving Skill Knowledge in Vision Language Action Model

Skypilot: Fine-Tuning LLM with Physical Grounding for AAV Coverage Search

Weakly-supervised Latent Models for Task-specific Visual-Language Control

MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Mixture of Horizons in Action Chunking

MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Complex Instruction Following with Diverse Style Policies in Football Games

Reinforcing Action Policies by Prophesying

$\mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation

Built with on top of