Multimodal Representation Learning and Retrieval

The field of multimodal representation learning and retrieval is rapidly advancing, with a focus on developing more efficient and effective methods for aligning and retrieving data across different modalities. Recent work has emphasized the importance of representation alignment, with techniques such as linear transformations and attention-based mechanisms being used to bridge the gap between different modalities. Additionally, there is a growing trend towards developing more unified and scalable models that can handle multiple modalities and tasks simultaneously. Notable papers in this area include: mini-vec2vec, which presents a simple and efficient alternative to the original vec2vec method for aligning text embedding spaces without parallel data. Omni-Embed-Nemotron, which introduces a unified multimodal retrieval embedding model that can handle text, image, audio, and video modalities. Guided Query Refinement, which proposes a novel test-time optimization method for refining a primary retriever's query embedding using guidance from a complementary retriever's scores. Efficient Discriminative Joint Encoders, which introduces an efficient discriminative joint encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, enabling high-throughput inference.

Sources

mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations

Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation - Technical Report for IROS 2025 RoboSense Challenge Track 4

Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

Learning-Based Hashing for ANN Search: Foundations and Early Advances

Compressed Concatenation of Small Embedding Models

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

multimodars: A Rust-powered toolkit for multi-modality cardiac image fusion and registration

CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Built with on top of