Advances in Multilingual NLP and Language Modeling

The field of Natural Language Processing (NLP) is rapidly advancing, with a strong focus on multilingual capabilities and language modeling. Recent research has explored the role of language families and morphology in cross-linguistic transfer, with findings indicating that language family proximity and morphological similarity can significantly impact model performance. Additionally, there is a growing interest in evaluating and improving the performance of large language models (LLMs) in low-resource languages, with techniques such as synthetic data generation and multitask learning showing promise. The development of new benchmarks and evaluation frameworks, such as MUG-Eval and MAPS, is also enabling more comprehensive assessments of LLMs and agentic AI systems in multilingual settings. Notable papers in this area include the proposal of MUG-Eval, a novel framework for evaluating LLMs' multilingual generation capabilities, and the introduction of MAPS, a multilingual benchmark suite for evaluating agentic AI systems. Furthermore, research on word order change and language evolution has led to the proposal of a universal underlying mechanism based on word class length, and the development of new methods for probing subphonemes in morphology models has improved our understanding of phonological feature encoding in transformers.

Sources

Reconstructing Syllable Sequences in Abugida Scripts with Incomplete Inputs

Probing Subphonemes in Morphology Models

A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography

Cross-Linguistic Transfer in Multilingual NLP: The Role of Language Families and Morphology

Word length predicts word order: "Min-max"-ing drives language evolution

Probing BERT for German Compound Semantics

MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

MAPS: A Multilingual Benchmark for Global Agent Performance and Security

LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Built with on top of