The field of robotic manipulation is witnessing significant advancements with the development of Vision-Language-Action (VLA) models. These models are capable of processing multimodal inputs, including language instructions, and have demonstrated strong performance in various robotic tasks. Recent research has focused on improving the robustness and scalability of VLA models, with a particular emphasis on long-horizon, multi-step tasks. Notable developments include the integration of temporal context, perceptual-cognitive memory, and phase-aware input masking strategies, which have enhanced the ability of VLA models to handle complex tasks. Furthermore, the use of self-supervised pretext tasks and large-scale datasets has improved the precision and generalization capability of these models. Overall, the field is moving towards the development of more robust, general-purpose VLA models that can execute complex tasks under real-world sensory constraints. Noteworthy papers in this regard include:
- MemoryVLA, which proposes a Cognition-Memory-Action framework for long-horizon robotic manipulation, achieving a 71.9% success rate on SimplerEnv-Bridge tasks.
- Long-VLA, which introduces a phase-aware input masking strategy for long-horizon robotic tasks, significantly outperforming prior state-of-the-art methods on the L-CALVIN benchmark.
- LaVA-Man, which learns visual-action representations through a self-supervised pretext task, demonstrating improved precision in manipulation tasks on the Omni-Object Pick-and-Place dataset.