The field of multimodal learning is moving towards a more comprehensive understanding of human behavior and emotions, with a focus on incorporating non-verbal cues and multimodal interactions. Recent research has highlighted the importance of mutual guidance between text and image modalities to effectively capture intention-related representations. Additionally, there is a growing interest in developing more robust and efficient methods for cross-modal retrieval and image captioning, particularly in low-resource languages. The use of optimal transport-based distance measures and vision-free retrieval pipelines are also being explored to improve the accuracy and privacy of multimodal models. Noteworthy papers in this area include: PCSR, which introduces a novel framework for enhancing correspondence reliability in cross-modal retrieval. RACap, which proposes a relation-aware retrieval-augmented model for image captioning. OTCCLIP, which reconstructs image-caption pairs using an optimal transport-based framework to defend against data poisoning. LexiCLIP, which introduces a vision-free retrieval pipeline that achieves state-of-the-art performance on multiple retrieval and compositionality benchmarks.