The field of vision-language-action (VLA) models is rapidly advancing, with a focus on improving spatial awareness and reasoning capabilities. Recent developments have shown that incorporating spatial information, either explicitly or implicitly, can significantly enhance the performance of VLA models in robotic tasks. Notable advancements include the use of depth prediction, spatial grounding, and geometric representations to improve action precision and spatial reasoning. These innovations have the potential to enable robots to better understand and interact with their environment, leading to more effective and efficient task execution.
Noteworthy papers include: InternVLA-M1, which introduces a spatially guided vision-language-action framework that achieves state-of-the-art results in instruction-following robots. DepthVLA, which presents a VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module, outperforming existing approaches in real-world and simulated environments.