Advances in Vision-Language Navigation and Embodied Agents

The field of robotics and artificial intelligence is witnessing significant advancements in vision-language navigation and embodied agents. Recent developments have focused on improving the ability of robots to understand and execute high-level language instructions in complex environments. Researchers are exploring the use of large language models and vision-language models to enhance navigation and control tasks. One of the key innovations is the use of grounded vision-language planning models, which can generate plans based on visual and linguistic inputs. Another area of research is the development of teleoperation systems that enable remote driving and assistance of automated vehicles. Noteworthy papers in this area include Gondola, which introduces a novel grounded vision-language planning model for generalizable robotic manipulation, and GRaD-Nav++, which presents a lightweight Vision-Language-Action framework for real-time visual drone navigation. DyNaVLM is also a significant contribution, as it presents an end-to-end vision-language navigation framework that empowers agents to freely select navigation targets via visual-language reasoning. Casper is another notable paper, as it introduces an assistive teleoperation system that leverages commonsense knowledge embedded in pre-trained visual language models for real-time intent inference and flexible skill execution.

Sources

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

TUM Teleoperation: Open Source Software for Remote Driving and Assistance of Automated Vehicles

GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics

Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments

Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?

Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models

HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory

FindingDory: A Benchmark to Evaluate Memory in Embodied Agents

Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Built with on top of