The field of robotic manipulation is rapidly advancing with the development of vision-language-action (VLA) models. These models integrate vision, language, and action generation modules to enable robots to perform complex tasks. Recent research has focused on improving the generalization capabilities of VLA models, allowing them to adapt to new tasks and environments with minimal training data. Notable advancements include the use of large language models to generate structured scene descriptors, enhancing visual understanding and task performance. Additionally, researchers have explored the use of action consistency constraints to align visual perception with corresponding actions, and multi-turn visual dialogue frameworks to model long-term task execution. Furthermore, studies have investigated the use of foundation models, such as behavior foundation models, to learn reusable primitive skills and behavioral priors, enabling zero-shot or rapid adaptation to downstream tasks. Overall, the development of VLA models is transforming the field of robotic manipulation, enabling robots to learn and adapt in complex environments. Some noteworthy papers in this area include: UniVLA, which presents a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. GoalLadder, which proposes a novel method that leverages vision-language models to train RL agents from a single language instruction in visual environments. ControlVLA, which introduces a framework that bridges pre-trained VLA models with object-centric representations for efficient fine-tuning.
Advances in Vision-Language-Action Models for Robotic Manipulation
Sources
CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset
T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models
Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models
CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction
How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?
Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends