The field of robotics and artificial intelligence is witnessing significant advancements in vision-language navigation and embodied agents. Recent developments have focused on improving the ability of robots to understand and execute high-level language instructions in complex environments. Researchers are exploring the use of large language models and vision-language models to enhance navigation and control tasks. One of the key innovations is the use of grounded vision-language planning models, which can generate plans based on visual and linguistic inputs. Another area of research is the development of teleoperation systems that enable remote driving and assistance of automated vehicles. Noteworthy papers in this area include Gondola, which introduces a novel grounded vision-language planning model for generalizable robotic manipulation, and GRaD-Nav++, which presents a lightweight Vision-Language-Action framework for real-time visual drone navigation. DyNaVLM is also a significant contribution, as it presents an end-to-end vision-language navigation framework that empowers agents to freely select navigation targets via visual-language reasoning. Casper is another notable paper, as it introduces an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models for real-time intent inference and flexible skill execution.
Advances in Vision-Language Navigation and Embodied Agents
Sources
GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics
Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments