Advancements in Robotic Manipulation and Vision-Language-Action Models

The field of robotic manipulation is witnessing significant advancements with the integration of vision-language-action models. Researchers are exploring innovative approaches to improve the robustness and generalizability of these models in various environments. One notable direction is the development of diffusion policies for robotic manipulation, which has shown promising results in handling fabrication uncertainties and contact-rich assembly tasks. Another area of focus is the creation of generative models for articulated mechanisms, enabling the simulation and generation of novel morphologies. Moreover, the introduction of synergistic declarative memory and active visual attention mechanisms is enhancing the performance of vision-language-action models in long-horizon mobile manipulation tasks. Noteworthy papers in this area include: Learning Diffusion Policies for Robotic Manipulation of Timber Joinery under Fabrication Uncertainty, which demonstrates the potential of sensory-motor diffusion policies to generalize to complex assembly tasks. ArticFlow, which introduces a two-stage flow matching framework for generative simulation of articulated mechanisms. EchoVLA, which presents a memory-aware VLA model for long-horizon mobile manipulation with improved performance. SkillWrapper, which proposes a method for generative predicate invention for skill abstraction, enabling provably sound and complete planning. Object-centric Task Representation and Transfer using Diffused Orientation Fields, which introduces an approach for transfer learning of tasks across curved objects using diffused orientation fields. AVA-VLA, which reformulates the vision-language-action problem from a Partially Observable Markov Decision Process perspective and proposes a novel framework with active visual attention. Rethinking Intermediate Representation for VLM-based Robot Manipulation, which designs a semantic assembly representation for improving robot manipulation. ArtiBench and ArtiBrain, which introduce a benchmark and a modular framework for generalizable vision-language articulated object manipulation. OVAL-Grasp, which proposes an open-vocabulary approach to task-oriented grasping using large-language models and vision-language models.

Sources

Learning Diffusion Policies for Robotic Manipulation of Timber Joinery under Fabrication Uncertainty

ArticFlow: Generative Simulation of Articulated Mechanisms

EchoVLA: Robotic Vision-Language-Action Model with Synergistic Declarative Memory for Mobile Manipulation

SkillWrapper: Generative Predicate Invention for Skill Abstraction

Object-centric Task Representation and Transfer using Diffused Orientation Fields

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

Rethinking Intermediate Representation for VLM-based Robot Manipulation

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

OVAL-Grasp: Open-Vocabulary Affordance Localization for Task Oriented Grasping

Built with on top of