The field of 3D perception and scene understanding is rapidly advancing, with a focus on developing efficient and accurate methods for tasks such as point cloud segmentation, 3D object detection, and instance segmentation. Recent research has explored the use of visual foundation models and 2D-centric pipelines to improve performance and reduce computational cost. Notably, the application of pre-trained 2D models to 3D tasks has shown promising results, enabling fast and accurate predictions. Additionally, the development of novel frameworks and architectures has led to state-of-the-art performance on various benchmarks. Some notable papers in this area include: RangeSAM, which leverages visual foundation models for range-view represented LiDAR segmentation, achieving competitive performance on SemanticKITTI. Sparse Multiview Open-Vocabulary 3D Detection, which establishes a powerful baseline for open-vocabulary 3D object detection in sparse-view settings. SegDINO3D, which achieves state-of-the-art performance on 3D instance segmentation benchmarks by leveraging both image-level and object-level 2D features.