Multimodal Information Retrieval and Representation Learning

The field of multimodal information retrieval and representation learning is moving towards developing more robust and efficient methods for handling diverse types of data. Researchers are exploring innovative approaches to improve the performance of multimodal models, including the use of weak supervision, multimodal fusion, and alignment techniques. Notably, there is a growing interest in leveraging pre-trained encoders and knowledge distillation to enhance the generalization and diversity of generated results. Furthermore, the development of new datasets and evaluation metrics is facilitating the advancement of this field. Overall, the field is progressing towards more effective and efficient methods for multimodal information retrieval and representation learning. Noteworthy papers include:

  • FemmIR, a framework that retrieves multimodal results relevant to information needs expressed with multimodal queries by example without any similarity label.
  • TRIDENT, a novel framework that integrates molecular SMILES, textual descriptions, and taxonomic functional annotations to learn rich molecular representations, achieving state-of-the-art performance on 11 downstream tasks.
  • DALR, a dual-level alignment learning approach for multimodal sentence representation learning that addresses cross-modal misalignment bias and intra-modal semantic divergence, demonstrating superiority over state-of-the-art baselines.

Sources

Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision

On the Burstiness of Faces in Set

Semantic-enhanced Modality-asymmetric Retrieval for Online E-commerce Search

A Multi-Stage Framework for Multimodal Controllable Speech Synthesis

TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Global and Local Entailment Learning for Natural World Imagery

Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval

Built with on top of