Integrating Language and 3D Vision for Robotic Perception and Autonomy

The field of robotic perception and autonomy is witnessing a significant shift towards the integration of large language models (LLMs) with 3D vision. This convergence enables machines to perceive, reason, and interact with complex environments through natural language and spatial understanding. Recent developments have focused on enhancing robotic sensing technologies, with a particular emphasis on scene understanding, text-to-3D generation, object grounding, and embodied agents. Noteworthy advancements include the development of multimodal LLMs that integrate 3D data with other sensory inputs, such as touch and auditory information, to enhance environmental comprehension and robotic decision-making. Furthermore, innovative approaches to cross-view geo-localization, 3D scene editing, and vision-language models are being explored to improve the accuracy and robustness of robotic perception systems.

Notable papers in this area include: UniABG, which proposes a novel dual-stage unsupervised cross-view geo-localization framework that achieves state-of-the-art performance. Part-X-MLLM, which introduces a native 3D multimodal large language model that unifies diverse 3D tasks and enables state-of-the-art performance in grounded Q&A, compositional generation, and localized editing. SMGeo, which presents a promptable end-to-end transformer-based model for object geo-localization that achieves leading performance in accuracy. LEGO-SLAM, which proposes a framework that achieves real-time, open-vocabulary mapping within a 3DGS-based SLAM system.

Sources

Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review

UniABG: Unified Adversarial View Bridging and Graph Correspondence for Unsupervised Cross-View Geo-Localization

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Error-Driven Scene Editing for 3D Grounding in Large Language Models

SMGeo: Cross-View Object Geo-Localization with Grid-Level Mixture-of-Experts

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

Built with on top of