Advances in Vision-Language-Action Models for Robotics

The field of robotics is moving towards the development of more generalizable and robust Vision-Language-Action (VLA) models. Recent research has focused on improving the performance of VLA models in various tasks, such as visual navigation, manipulation, and tracking. One of the key directions is the integration of multimodal learning, where models are trained on multiple sources of data, including vision, language, and action. This approach has shown promising results in improving the generalization capabilities of VLA models. Another important area of research is the development of more efficient and scalable training methods, such as reinforcement learning and meta-learning. These methods have been shown to improve the performance of VLA models in complex tasks and environments. Noteworthy papers in this area include MM-Nav, which proposes a multi-view VLA model for robust visual navigation, and TrackVLA++, which enhances embodied visual tracking with spatial reasoning and temporal memory mechanisms. Overall, the field is moving towards the development of more advanced and generalizable VLA models that can be applied to real-world robotic applications.

Sources

From Pixels to Factors: Learning Independently Controllable State Variables for Reinforcement Learning

MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Fine-Tuning on Noisy Instructions: Effects on Generalization and Performance

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

SITCOM: Scaling Inference-Time COMpute for VLAs

NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

Zenbo Patrol: A Social Assistive Robot Based on Multimodal Deep Learning for Real-time Illegal Parking Recognition and Notification

ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks

StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

Bring the Apple, Not the Sofa: Impact of Irrelevant Context in Embodied AI Commands on VLA Models

Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications

Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics