Advances in Multilingual NLP and Fairness

The field of Natural Language Processing (NLP) is moving towards more inclusive and equitable models, with a focus on multilingualism and fairness. Recent research has highlighted the importance of considering cultural and linguistic nuances in NLP models, particularly in low-resource languages. The development of new models and techniques, such as Parity-aware Byte Pair Encoding and H-Net++, has improved cross-lingual fairness and tokenization in morphologically-rich languages. Additionally, there is a growing emphasis on addressing discrimination and bias in NLP, with a shift towards framing the problem as a systemic issue rather than a technological one. Noteworthy papers include: RooseBERT, a novel pre-trained Language Model for political discourse language, which has shown significant improvements over general-purpose Language Models on downstream tasks. CogBench, a benchmark for evaluating the cross-lingual and cross-site generalizability of large language models for speech-based cognitive impairment assessment, which has demonstrated the importance of considering linguistic and cultural factors in NLP models.

Sources

Cross-lingual Opinions and Emotions Mining in Comparable Documents

Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification

Somatic in the East, Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLMs

RooseBERT: A New Deal For Political Language Modelling

CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

FairLangProc: A Python package for fairness in NLP

Health Insurance Coverage Rule Interpretation Corpus: Law, Policy, and Medical Guidance for Health Insurance Coverage Understanding

Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts

Moving beyond harm. A critical review of how NLP research approaches discrimination

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Do Political Opinions Transfer Between Western Languages? An Analysis of Unaligned and Aligned Multilingual LLMs

H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages