Advancements in Multilingual Information Retrieval and Large Language Models

The field of multilingual information retrieval and large language models is rapidly evolving, with a focus on improving performance and efficiency in real-world scenarios. Recent developments have centered around optimizing multilingual data allocation, improving the quality and diversity of instruction fine-tuning datasets, and enhancing the versatility of models for multilingual applications. Notably, innovative methods for optimizing language ratios and selecting high-quality training data have shown promising results. Furthermore, research has highlighted the importance of token overlap in multilingual models and the need for high-quality, diverse training data. The development of large-scale, high-quality parallel corpora for Indian languages and the introduction of novel frameworks for optimizing multilingual data allocation have also contributed significantly to the field. Noteworthy papers include: Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text, which achieves promising results across diverse retrieval scenarios. Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining, which introduces a novel framework for optimizing multilingual data allocation. A method for improving multilingual quality and diversity of instruction fine-tuning datasets, which demonstrates significant performance gains over vanilla baselines.

Sources

Efficient and Versatile Model for Multilingual Information Retrieval of Islamic Text: Development and Deployment in Real-World Scenarios

Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining

A method for improving multilingual quality and diversity of instruction fine-tuning datasets

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations

False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

Human-Annotated NER Dataset for the Kyrgyz Language

How Much of Your Data Can Suck? Thresholds for Domain Performance and Emergent Misalignment in LLMs

EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation

Mah\={a}n\={a}ma: A Unique Testbed for Literary Entity Discovery and Linking

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

Less is More: The Effectiveness of Compact Typological Language Representations

Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks

Into the Void: Understanding Online Health Information in Low-Web Data Languages