The field of 3D scene understanding is moving towards more nuanced and contextually aware representations, with a focus on integrating semantic richness and geometric detail. This is driven by the development of new datasets and annotation pipelines that enable dense captioning of scene elements and high-level question generation. As a result, downstream tasks such as visual-language navigation and interactive question answering are becoming more effective. Noteworthy papers in this area include DenseScan, which introduces a novel dataset with detailed multi-level descriptions, and LISA-3D, which lifts language-image segmentation into 3D via multi-view consistency. SpatialReasoner is also notable for its active perception framework that autonomously invokes spatial tools to explore 3D scenes based on textual queries. DepthScape is another innovative work that facilitates 2.5D effect creation by directly placing design elements into 3D reconstructions.