Multimodal Learning for Enhanced Understanding

The field of multimodal learning is moving towards a more comprehensive understanding of human behavior and emotions, with a focus on incorporating non-verbal cues and multimodal interactions. Recent research has highlighted the importance of mutual guidance between text and image modalities to effectively capture intention-related representations. Additionally, there is a growing interest in developing more robust and efficient methods for cross-modal retrieval and image captioning, particularly in low-resource languages. The use of optimal transport-based distance measures and vision-free retrieval pipelines are also being explored to improve the accuracy and privacy of multimodal models. Noteworthy papers in this area include: PCSR, which introduces a novel framework for enhancing correspondence reliability in cross-modal retrieval. RACap, which proposes a relation-aware retrieval-augmented model for image captioning. OTCCLIP, which reconstructs image-caption pairs using an optimal transport-based framework to defend against data poisoning. LexiCLIP, which introduces a vision-free retrieval pipeline that achieves state-of-the-art performance on multiple retrieval and compositionality benchmarks.

Sources

Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues

PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning

Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

What Makes You Unique? Attribute Prompt Composition for Object Re-Identification

Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Built with on top of