The field of 3D scene understanding and vision-language models is rapidly advancing, with a focus on developing more efficient and effective methods for representing and understanding 3D scenes. Recent research has explored the use of large language models, self-distillation techniques, and novel encoding methods to improve performance on tasks such as 3D scene understanding, open-vocabulary dense prediction, and autonomous driving. Notably, the incorporation of 3D point cloud features and geometric cues has shown significant promise in enhancing the ability of vision-language models to understand 3D spatial structures. Furthermore, the development of large-scale benchmarks and datasets has facilitated the evaluation and improvement of these models. Some particularly noteworthy papers include: Pts3D-LLM, which proposes a novel approach for enriching visual tokens with 3D point cloud features, and ATAS, which introduces a self-distillation method for enhancing semantic coherence and fine-grained alignment in vision-language models. Additionally, LEO-VL and Vireo propose efficient scene representation and novel frameworks for 3D vision-language understanding and open-vocabulary domain-generalized semantic segmentation, respectively.