Advances in Text Embeddings and Information Retrieval

The field of natural language processing is moving towards more effective and efficient methods for text embeddings and information retrieval. Recent developments have focused on improving the performance of large language models and embedding models, particularly in low-resource languages and domains. Innovative approaches, such as in-context learning and label distribution learning, have shown promising results in predicting annotator-specific annotations and generating soft labels. Additionally, new resources and models have been introduced to support the development of Dutch embeddings and to improve the performance of retrieval models. Noteworthy papers include: Conan-Embedding-v2, which achieved state-of-the-art performance on the Massive Text Embedding Benchmark with a novel training methodology, and zELO, which introduced a novel training method for rerankers and embedding models that optimizes retrieval performance. Hashing-Baseline also presented a strong training-free hashing method leveraging powerful pretrained encoders. These advancements have the potential to significantly impact the field of natural language processing and improve the performance of various applications.

Sources

DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning

MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch

zELO: ELO-inspired Training Method for Rerankers and Embedding Models

Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models

Built with on top of