Advances in Vision-Language-Action Models for Robotic Manipulation

The field of robotic manipulation is rapidly advancing with the development of vision-language-action (VLA) models. These models integrate vision, language, and action generation modules to enable robots to perform complex tasks. Recent research has focused on improving the generalization capabilities of VLA models, allowing them to adapt to new tasks and environments with minimal training data. Notable advancements include the use of large language models to generate structured scene descriptors, enhancing visual understanding and task performance. Additionally, researchers have explored the use of action consistency constraints to align visual perception with corresponding actions, and multi-turn visual dialogue frameworks to model long-term task execution. Furthermore, studies have investigated the use of foundation models, such as behavior foundation models, to learn reusable primitive skills and behavioral priors, enabling zero-shot or rapid adaptation to downstream tasks. Overall, the development of VLA models is transforming the field of robotic manipulation, enabling robots to learn and adapt in complex environments. Some noteworthy papers in this area include: UniVLA, which presents a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. GoalLadder, which proposes a novel method that leverages vision-language models to train RL agents from a single language instruction in visual environments. ControlVLA, which introduces a framework that bridges pre-trained VLA models with object-centric representations for efficient fine-tuning.

Sources

DIGMAPPER: A Modular System for Automated Geologic Map Digitization

VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

Neurosymbolic Object-Centric Learning with Distant Supervision

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

CapsDT: Diffusion-Transformer for Capsule Robot Manipulation

CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset

GoalLadder: Incremental Goal Discovery with Vision-Language Models

CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity

Scaffolding Dexterous Manipulation with Vision-Language Models

Is an object-centric representation beneficial for robotic manipulation ?

T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models

CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation

Unified Vision-Language-Action Model

Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models

CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition

Behavior Foundation Model: Towards Next-Generation Whole-Body Control System of Humanoid Robots

HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction

How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction?

Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends

ACTLLM: Action Consistency Tuned Large Language Model

LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning

WorldVLA: Towards Autoregressive Action World Model