Advances in Multilingual Natural Language Processing

The field of natural language processing is moving towards greater support for low-resource languages, with several papers presenting new datasets, models, and techniques for improving performance in these languages. A key trend is the use of multilingual models, which have been shown to outperform monolingual models on a range of tasks. Another area of focus is the development of more efficient and effective methods for adapting large language models to new languages and tasks. Notable papers include those presenting new datasets for Vietnamese and Nepali, as well as a study on the use of few-shot prompting for in-context learning in low-resource languages. Noteworthy papers include:

VSMRC, which presents a new dataset for Vietnamese text segmentation and multiple-choice reading comprehension.
NepaliGPT, which introduces a generative language model for the Nepali language.
RELIC, which proposes a novel framework for enhancing reward model generalization for low-resource Indic languages.

Sources

A Vietnamese Dataset for Text Segmentation and Multiple Choices Reading Comprehension

HausaNLP at SemEval-2025 Task 11: Advancing Hausa Text-based Emotion Detection

NepaliGPT: A Generative Language Model for the Nepali Language

Relic: Enhancing Reward Model Generalization for Low-Resource Indic Languages with Few-Shot Examples

Automatic Speech Recognition Biases in Newcastle English: an Error Analysis

Semantic Outlier Removal with Embedding Models and LLMs

Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis

CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation

Semantic Caching for Improving Web Affordability

Multi-lingual Functional Evaluation for Large Language Models

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Text2Cypher Across Languages: Evaluating Foundational Models Beyond English