The field of multimodal information retrieval and representation learning is moving towards developing more robust and efficient methods for handling diverse types of data. Researchers are exploring innovative approaches to improve the performance of multimodal models, including the use of weak supervision, multimodal fusion, and alignment techniques. Notably, there is a growing interest in leveraging pre-trained encoders and knowledge distillation to enhance the generalization and diversity of generated results. Furthermore, the development of new datasets and evaluation metrics is facilitating the advancement of this field. Overall, the field is progressing towards more effective and efficient methods for multimodal information retrieval and representation learning. Noteworthy papers include:
- FemmIR, a framework that retrieves multimodal results relevant to information needs expressed with multimodal queries by example without any similarity label.
- TRIDENT, a novel framework that integrates molecular SMILES, textual descriptions, and taxonomic functional annotations to learn rich molecular representations, achieving state-of-the-art performance on 11 downstream tasks.
- DALR, a dual-level alignment learning approach for multimodal sentence representation learning that addresses cross-modal misalignment bias and intra-modal semantic divergence, demonstrating superiority over state-of-the-art baselines.