Advances in Multilingual Natural Language Processing

The field of natural language processing is moving towards greater inclusivity and support for low-resource languages. Researchers are developing innovative methods for fine-tuning large language models to improve their performance in these languages, including the use of QLoRA and cross-lingual instruction tuning. There is also a growing focus on creating high-quality, culturally grounded datasets for multilingual natural language processing, with an emphasis on preserving cultural nuance and task diversity. Noteworthy papers include Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text, which demonstrates the efficacy of QLoRA in fine-tuning high-performing models for low-resource environments. Another notable paper is LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish, which highlights the benefits of cross-lingual data curation for low-resource language development.

Sources

Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy (v20251005)

Fine Tuning Methods for Low-resource Languages

Scalable multilingual PII annotation for responsible AI in LLMs

Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Built with on top of