Advances in Multilingual Language Models and Evaluation Paradigms

The field of natural language processing is moving towards more sophisticated and inclusive models, with a focus on multilingual capabilities and robust evaluation paradigms. Recent developments have highlighted the importance of considering linguistic diversity and typological relationships between languages. Models are being designed to handle low-resource languages and dialects, and new techniques are being introduced to improve cross-lingual transfer and reduce language biases. The use of entropy-based language representations and morphology-aware subword construction are examples of innovative approaches that enhance linguistic fidelity and token efficiency. Furthermore, the development of novel evaluation frameworks and metrics is enabling more accurate assessments of model performance and generalization capabilities. Noteworthy papers in this regard include the introduction of Camlang, a novel constructed language for evaluating metalinguistic reasoning in large language models, and the development of Entropy2Vec, a framework for deriving cross-lingual language representations. Additionally, the Hunyuan-MT model has achieved state-of-the-art performance in multilingual translation, and the MERLIN framework has improved accuracy on low-resource languages through multi-stage curriculum alignment.

Sources

The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

TMT: A Simple Way to Translate Topic Models Using Dictionaries

Expanding Foundational Language Capabilities in Open-Source LLMs through a Korean Case Study

What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages

Sample-efficient Integration of New Modalities into Large Language Models

Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?

PLaMo 2 Technical Report

Masked Diffusion Language Models with Frequency-Informed Training

Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Hunyuan-MT Technical Report

No Translation Needed: Forecasting Quality from Fertility and Metadata

The Token Tax: Systematic Bias in Multilingual Tokenization

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Language Bias in Information Retrieval: The Nature of the Beast and Mitigation Methods

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning

Bilingual Word Level Language Identification for Omotic Languages

MERLIN: Multi-Stage Curriculum Alignment for Multilingual Encoder and LLM Fusion

MoVoC: Morphology-Aware Subword Construction for Geez Script Languages