Advances in Vision-Language-Action Models for Robotic Manipulation

The field of robotic manipulation is witnessing significant advancements with the development of vision-language-action (VLA) models. These models have shown great promise in achieving general-purpose manipulation by integrating visual, linguistic, and action-based inputs. Recent research has focused on improving the adaptability, accuracy, and efficiency of VLA models in various scenarios, including out-of-distribution settings and long-horizon tasks. Noteworthy papers in this area include EL3DD, which proposes an extended latent 3D diffusion model for language-conditioned multitask manipulation, and AsyncVLA, which introduces asynchronous flow matching for VLA models to enable self-correction in action generation. Additionally, the development of benchmarks such as RoboTidy and FreeAskWorld has facilitated the evaluation and comparison of VLA models in realistic scenarios. Overall, the field is moving towards more robust, efficient, and generalizable VLA models that can be deployed in real-world robotic manipulation tasks.

Sources

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation

EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

OpenRoboCare: A Multimodal Multi-Task Expert Demonstration Dataset for Robot Caregiving

AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

RoboTidy : A 3D Gaussian Splatting Household Tidying Benchmark for Embodied Navigation and Action

Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

EvoVLA: Self-Evolving Vision-Language-Action Model

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization