The field of multimodal scene understanding is moving towards more comprehensive and integrated approaches, incorporating multiple modalities and sensors to better comprehend the physical world. This trend is driven by the need to address the limitations of traditional single-modality methods and the increasing availability of multi-modal data. Recent advances in this area focus on developing frameworks that can effectively fuse and represent multiple modalities, such as visual, auditory, and sensor data, to enable more accurate and robust scene understanding. Noteworthy papers in this area have introduced novel methods for joint embedding spaces, modality modeling, and incomplete multimodal learning, which have shown promising results in various applications, including emotion recognition, outdoor scene understanding, and image retrieval. Notable contributions include the proposal of a unified embedding space for timestamp prediction and geo-localization, and the development of a multidomain perception outdoor scene understanding dataset and model. Notable papers include GT-Loc, which jointly predicts the capture time and geo-location of an image, and City-VLM, which introduces a multidomain perception outdoor scene understanding dataset and model.