The field of multimodal understanding is moving towards enhancing spatial reasoning and culturally grounded understanding. Recent developments have focused on integrating spatial features and multimodal embeddings to improve visual spatial reasoning. Additionally, there is a growing emphasis on creating datasets and models that can handle specialized cultural heritage domains and low-resource languages. Noteworthy papers include: Spatial-ViLT, which introduces a multi-task learning framework to enhance visual spatial reasoning, and EverydayMMQA, which provides a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering. Other notable works include the development of the VaseVQA-3D dataset for ancient Greek pottery analysis and the proposal of the VLCAP framework for Arabic image captioning.