Vision-Language-Action Models: Advancements in Reinforcement Learning and Embodied Intelligence

The field of Vision-Language-Action (VLA) models is rapidly advancing, with a focus on improving reinforcement learning and embodied intelligence. Recent developments have centered around enhancing the capabilities of VLA models through online interaction, post-deployment learning, and decision understanding. Researchers are exploring new paradigms, such as shifting from path imitation to decision understanding, to build agents that can truly navigate and understand their environment. Noteworthy papers in this area include: Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models, which proposes a new algorithm for stable and scalable online reinforcement fine-tuning of VLA models. Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback, which introduces a general post-deployment learning framework for embodied agents to acquire new knowledge and enhance task performance. CompassNav: Steering From Path Imitation To Decision Understanding In Navigation, which argues for a new paradigm in navigation and introduces a novel dataset and gap-aware hybrid reward function to develop an internal 'compass' for navigation. RoVer: Robot Reward Model as Test-Time Verifier for Vision-Language-Action Model, which presents a test-time scaling framework that uses a robot process reward model to enhance the capabilities of existing VLA models without modifying their architectures or weights. FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks, which explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations. Reflection-Based Task Adaptation for Self-Improving VLA, which introduces a framework for rapid, autonomous task adaptation without human intervention. EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems, which presents an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients. Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning, which proposes a novel framework that unifies environment perception and route choice modeling for effective trajectory representation learning.

Sources

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models

Dejavu: Post-Deployment Learning for Embodied Agents via Experience Feedback

CompassNav: Steering From Path Imitation To Decision Understanding In Navigation

RoVer: Robot Reward Model as Test-Time Verifier for Vision-Language-Action Model

FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

Reflection-Based Task Adaptation for Self-Improving VLA

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Unifying Environment Perception and Route Choice Modeling for Trajectory Representation Learning

Built with on top of