The field of vision-language navigation and manipulation is rapidly advancing, with a focus on enabling autonomous robots to navigate unfamiliar environments and perform complex tasks using natural language instructions. Recent research has explored the use of large vision-language models, multimodal learning, and reinforcement learning to improve the performance and generalization of these systems. One of the key directions is the development of more efficient and effective methods for vision-language grounding, which enables robots to understand and execute natural language instructions in complex environments. Another important area of research is the integration of vision-language models with robotic manipulation, allowing robots to perform tasks such as grasping and object manipulation using natural language descriptions. Noteworthy papers in this area include: Following Route Instructions using Large Vision-Language Models, which investigates the use of off-the-shelf large vision-language models for vision-language navigation. Point2Act, which proposes a method for efficient 3D distillation of multimodal large language models for zero-shot context-aware grasping. Language as Cost, which presents a framework for proactive hazard mapping using vision-language models for robot navigation. Enhancing Vision-Language Model Training with Reinforcement Learning, which introduces a lightweight reinforcement learning algorithm for training vision-language models in synthetic worlds. INTENTION, which proposes a novel framework for enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios. MAG-Nav, which presents a navigation framework built upon off-the-shelf visual language models, enhanced with perspective-based active grounding and historical memory backtracking. Analyzing the Impact of Multimodal Perception, which examines the theoretical foundations of multimodal imitation learning. Learning to See and Act, which proposes a framework for task-aware view planning for robotic manipulation. Information-Theoretic Graph Fusion, which proposes a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations.
Vision-Language Navigation and Manipulation
Sources
Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success
INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM