The field of embodied intelligence is rapidly advancing, with a focus on developing more efficient and effective vision-language-action (VLA) models. Recent research has explored innovative approaches to improve the performance and generalizability of VLA models, including the use of synergistic quantization-aware pruning frameworks, task-adaptive 3D grounding mechanisms, and embodiment-aware reasoning frameworks. These advancements have enabled VLA models to achieve state-of-the-art performance in various tasks, such as visual navigation, robotic manipulation, and human-robot interaction. Noteworthy papers in this area include SQAP-VLA, which introduced a structured framework for simultaneous quantization and token pruning, and OmniEVA, which proposed a versatile planner that enables advanced embodied reasoning and task planning. Other notable works, such as VLA-Adapter and SimpleVLA-RL, have demonstrated the effectiveness of novel paradigms for bridging vision-language representations to action and improving the long-horizon step-by-step action planning of VLA models.
Advances in Vision-Language-Action Models for Embodied Intelligence
Sources
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning
Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning