Advancements in 3D Scene Understanding and Vision-Language Models

The field of 3D scene understanding and vision-language models is rapidly advancing, with a focus on developing more efficient and effective methods for representing and understanding 3D scenes. Recent research has explored the use of large language models, self-distillation techniques, and novel encoding methods to improve performance on tasks such as 3D scene understanding, open-vocabulary dense prediction, and autonomous driving. Notably, the incorporation of 3D point cloud features and geometric cues has shown significant promise in enhancing the ability of vision-language models to understand 3D spatial structures. Furthermore, the development of large-scale benchmarks and datasets has facilitated the evaluation and improvement of these models. Some particularly noteworthy papers include: Pts3D-LLM, which proposes a novel approach for enriching visual tokens with 3D point cloud features, and ATAS, which introduces a self-distillation method for enhancing semantic coherence and fine-grained alignment in vision-language models. Additionally, LEO-VL and Vireo propose efficient scene representation and novel frameworks for 3D vision-language understanding and open-vocabulary domain-generalized semantic segmentation, respectively.

Sources

Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

GLD-Road:A global-local decoding road network extraction model for remote sensing images

HSENet: Hybrid Spatial Encoding Network for 3D Medical Vision-Language Understanding

Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving

Built with on top of