Advances in 3D Visual Grounding and Scene Understanding

The field of 3D visual grounding and scene understanding is rapidly advancing, with a focus on developing more effective methods for capturing semantic information from 3D scenes. Researchers are exploring the use of multimodal large language models (VLMs) to extract diverse semantic attributes and relations from scenes, and to create more accurate and robust 3D scene representations. This has led to improvements in tasks such as semantic segmentation, 3D visual grounding, and object-centric mapping. Furthermore, the integration of VLMs with other techniques, such as Gaussian Splatting and superpoint graphs, is enabling more efficient and effective scene understanding. Noteworthy papers include: DSM, which proposes a diverse semantic map construction method for 3D visual grounding tasks. FMLGS, which presents an approach for part-level open-vocabulary query within 3D Gaussian Splatting. FindAnything, which introduces an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Easy3D, which presents a simple yet effective method for 3D interactive segmentation. PARTFIELD, which proposes a feedforward approach for learning part-based 3D features. Object-Driven Narrative in AR, which explores integrating Vision Language Models into AR pipelines. Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs, which introduces a training-free framework that constructs a superpoint graph directly from Gaussian primitives.

Advances in 3D Visual Grounding and Scene Understanding

Sources