Vision-Language Navigation and Manipulation

The field of vision-language navigation and manipulation is rapidly advancing, with a focus on enabling autonomous robots to navigate unfamiliar environments and perform complex tasks using natural language instructions. Recent research has explored the use of large vision-language models, multimodal learning, and reinforcement learning to improve the performance and generalization of these systems. One of the key directions is the development of more efficient and effective methods for vision-language grounding, which enables robots to understand and execute natural language instructions in complex environments. Another important area of research is the integration of vision-language models with robotic manipulation, allowing robots to perform tasks such as grasping and object manipulation using natural language descriptions. Noteworthy papers in this area include: Following Route Instructions using Large Vision-Language Models, which investigates the use of off-the-shelf large vision-language models for vision-language navigation. Point2Act, which proposes a method for efficient 3D distillation of multimodal large language models for zero-shot context-aware grasping. Language as Cost, which presents a framework for proactive hazard mapping using vision-language models for robot navigation. Enhancing Vision-Language Model Training with Reinforcement Learning, which introduces a lightweight reinforcement learning algorithm for training vision-language models in synthetic worlds. INTENTION, which proposes a novel framework for enabling robots with learned interactive intuition and autonomous manipulation in diverse scenarios. MAG-Nav, which presents a navigation framework built upon off-the-shelf visual language models, enhanced with perspective-based active grounding and historical memory backtracking. Analyzing the Impact of Multimodal Perception, which examines the theoretical foundations of multimodal imitation learning. Learning to See and Act, which proposes a framework for task-aware view planning for robotic manipulation. Information-Theoretic Graph Fusion, which proposes a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations.

Sources

Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

Point2Act: Efficient 3D Distillation of Multimodal LLMs for Zero-Shot Context-Aware Grasping

Language as Cost: Proactive Hazard Mapping using VLM for Robot Navigation

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

INTENTION: Inferring Tendencies of Humanoid Robot Motion Through Interactive Intuition and Grounded VLM

MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding

Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning

Learning to See and Act: Task-Aware View Planning for Robotic Manipulation

Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

Built with on top of