Multimodal Representation Learning

The field of multimodal representation learning is moving towards more efficient and effective methods for learning semantic embeddings. Recent work has focused on developing novel optimization frameworks, pre-training paradigms, and demonstration selection methods to improve the performance of vision-language models. Notably, researchers are exploring ways to decouple complementary objectives in contrastive learning, enabling the simultaneous optimization of multiple tasks. Additionally, there is a growing interest in developing more accurate and efficient methods for estimating normalization terms in contrastive loss functions. Overall, these advancements are leading to improved performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. Noteworthy papers include: NeuCLIP, which proposes a novel optimization framework for efficient large-scale CLIP training, and CoMa, which introduces a compressed pre-training phase for transforming VLMs into competitive embedding models. Compression then Matching achieves new state-of-the-art results among VLMs of comparable size on the MMEB. Efficient and Effective In-context Demonstration Selection with Coreset significantly improves the ICL performance compared to existing strategies.

Sources

GSE: Evaluating Sticker Visual Semantic Similarity via a General Sticker Encoder

NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

Efficient and Effective In-context Demonstration Selection with Coreset

Built with on top of