Advances in Multilingual NLP

The field of natural language processing (NLP) is moving towards more inclusive and diverse language models, with a focus on improving performance in low-resource languages. Recent research has highlighted the importance of script-aware specialization, cross-lingual transfer benefits, and activation alignment in enhancing multilingual NLP performance. Noteworthy papers in this area include the introduction of the Arabic Script RoBERTa family, which captures language-specific script features and statistics, and the development of a multi-agent interactive framework for generating long-context questions. Other notable works include the creation of a numerical scoring system to objectively measure Arabic rhetoric, the introduction of the Modern Uyghur Dependency Treebank, and the proposal of the Layout Error Detection benchmark for diagnosing structural layout errors in document layout analysis. These advances have the potential to improve NLP performance in a wide range of languages and applications.

Sources

The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Uncovering Cross-Linguistic Disparities in LLMs using Sparse Autoencoders

Towards Inclusive NLP: Assessing Compressed Multilingual Transformers across Diverse Language Benchmarks

Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation

Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Creation of a Numerical Scoring System to Objectively Measure and Compare the Level of Rhetoric in Arabic Texts: A Feasibility Study, and A Working Prototype

Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language

Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish

BALSAM: A Platform for Benchmarking Arabic Large Language Models

Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

DiffLoRA: Differential Low-Rank Adapters for Large Language Models