Advancements in Multimodal Interaction and Robot Navigation

The field of human-robot interaction and navigation is witnessing significant advancements, driven by the integration of multimodal sensing, large language models, and embodied cognition. Researchers are exploring innovative approaches to enable robots to interpret and respond to ambiguous instructions, navigate complex environments, and interact with users in a more natural and intuitive way. A key direction in this field is the development of frameworks that combine visual, auditory, and linguistic cues to improve robot perception and decision-making. Another important trend is the use of large language models to enhance robot navigation, object search, and human-robot interaction. Noteworthy papers in this area include: Take That for Me, which proposes a multimodal exophora resolution framework that leverages sound source localization and interactive questioning to improve robot understanding of ambiguous instructions. An Embodied AR Navigation Agent, which presents an embodied AR navigation system that integrates Building Information Modeling with a multi-agent retrieval-augmented generation framework to support flexible language-driven goal retrieval and route planning. DELIVER, which introduces a fully integrated framework for cooperative multi-robot pickup and delivery driven by natural language commands, achieving scalable and collision-free coordination in real-world settings.

Sources

Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance

SonoCraftAR: Towards Supporting Personalized Authoring of Sound-Reactive AR Interfaces by Deaf and Hard of Hearing Users

Egocentric Instruction-oriented Affordance Prediction via Large Multimodal Model

DELIVER: A System for LLM-Guided Coordinated Multi-Robot Pickup and Delivery using Voronoi-Based Relay Planning

CapTune: Adapting Non-Speech Captions With Anchored Generative Models

Language-Enhanced Mobile Manipulation for Efficient Object Search in Indoor Environments

Built with on top of