The field of vision-language-action models is rapidly advancing, with a focus on improving the robustness and generalization of these models in real-world scenarios. Recent developments have seen the integration of large language models and multimodal models to enhance the reasoning and decision-making capabilities of autonomous systems. The use of semantic world models, which predict task-relevant semantic information about the future, has shown promise in improving planning decisions. Additionally, the development of hierarchical models that decouple semantic planning from embodiment grounding has enabled more efficient and effective navigation in diverse environments. Noteworthy papers include: NavQ, which introduces a foresighted agent that uses Q-learning to train a Q-model for vision-and-language navigation, and VAMOS, which proposes a hierarchical vision-language-action model for capability-modulated and steerable navigation. LaViRA is also notable for its zero-shot vision language navigation framework that leverages the strengths of different scales of multimodal large language models. Overall, these advances have the potential to significantly improve the performance and reliability of autonomous systems in a wide range of applications.
Advances in Vision-Language-Action Models for Robotics and Autonomous Systems
Sources
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification
LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments