Advances in Vision-Language-Action Models for Embodied Intelligence

The field of embodied intelligence is rapidly advancing, with a focus on developing more efficient and effective vision-language-action (VLA) models. Recent research has explored innovative approaches to improve the performance and generalizability of VLA models, including the use of synergistic quantization-aware pruning frameworks, task-adaptive 3D grounding mechanisms, and embodiment-aware reasoning frameworks. These advancements have enabled VLA models to achieve state-of-the-art performance in various tasks, such as visual navigation, robotic manipulation, and human-robot interaction. Noteworthy papers in this area include SQAP-VLA, which introduced a structured framework for simultaneous quantization and token pruning, and OmniEVA, which proposed a versatile planner that enables advanced embodied reasoning and task planning. Other notable works, such as VLA-Adapter and SimpleVLA-RL, have demonstrated the effectiveness of novel paradigms for bridging vision-language representations to action and improving the long-horizon step-by-step action planning of VLA models.

Sources

SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

AGILOped: Agile Open-Source Humanoid Robot for Research

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

ObjectReact: Learning Object-Relative Control for Visual Navigation

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Gaussian path model library for intuitive robot motion programming by demonstration

DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning

GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation

Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning

The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

DialNav: Multi-turn Dialog Navigation with a Remote Guide

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

SimCoachCorpus: A naturalistic dataset with language and trajectories for embodied teaching

Toward Embodiment Equivariant Vision-Language-Action Policy

RealMirror: A Comprehensive, Open-Source Vision-Language-Action Platform for Embodied AI

Designing Latent Safety Filters using Pre-Trained Vision Models

CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human

Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation