Advances in Robot Manipulation and Vision-Language-Action Models

The field of robot manipulation is moving towards more generalizable and robust policies, leveraging advancements in vision-language-action (VLA) models and large-scale robot demonstrations. Recent works have focused on improving the scalability and efficiency of VLA models, enabling them to learn from unlabeled data and adapt to new tasks and environments. Notably, the use of latent action representations, diffusion-based reinforcement learning, and self-supervised learning has shown promising results in enhancing the performance and generalization of VLA models. Additionally, there is a growing interest in developing methods that can learn from human demonstrations, videos, and other forms of weak supervision, reducing the need for expensive and time-consuming data collection. Overall, these advancements have the potential to significantly improve the capabilities of robots in various applications, including robotic manipulation, surgical robotics, and autonomous systems.

Noteworthy papers include: PEEK, which fine-tunes vision-language models to predict a unified point-based intermediate representation for zero-shot generalization of robot manipulation policies. Latent Action Pretraining Through World Modeling, which proposes a model-agnostic framework for pretraining imitation learning models in a self-supervised way. Eva-VLA, which evaluates the robustness of VLA models under real-world physical variations and exposes critical gaps between controlled laboratory success and unpredictable deployment readiness. Beyond Human Demonstrations, which proposes a diffusion-based reinforcement learning approach to generate high-quality and low-variance trajectories for VLA training. LLM Trainer, which presents a fully automated pipeline for generating robot datasets via demonstration augmentation using large language models. Parse-Augment-Distill, which learns generalizable bimanual visuomotor policies from single human videos using a unified framework of parsing, augmentation, and distillation.

Sources

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Latent Action Pretraining Through World Modeling

Surgical Video Understanding with Label Interpolation

Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations

Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training

Generalist Robot Manipulation beyond Action Labeled Data

LLM Trainer: Automated Robotic Data Generating via Demonstration Augmentation using LLMs

Parse-Augment-Distill: Learning Generalizable Bimanual Visuomotor Policies from Single Human Video

Built with on top of