Advances in Natural Language Processing and Multilingual Evaluations

The field of natural language processing is moving towards a more nuanced understanding of language, with a focus on evaluating and improving the performance of large language models (LLMs) in multilingual settings. Recent research has highlighted the need for more comprehensive and diverse evaluation benchmarks, as well as the importance of considering the social and cultural context of language use. The development of new metrics and frameworks, such as the Single Token Retention Rate (STRR) and the LongQAEval framework, has enabled more accurate and equitable assessments of LLMs across languages and domains. Furthermore, studies have emphasized the need to address the digital epistemic injustice faced by marginalized languages and to develop more inclusive and linguistically informed tokenization strategies. Noteworthy papers include: NarraBench, which presents a comprehensive framework for narrative benchmarking and highlights the need for new evaluations covering overlooked aspects of narrative understanding. Invisible Languages of the LLM Universe, which proposes a critical framework for understanding linguistic inequality in AI systems and demonstrates the structural exclusion of marginalized languages. Tokenization Disparities as Infrastructure Bias, which conducts a large-scale cross-linguistic evaluation of tokenization efficiency and reveals substantial disparities in computational costs and effective context utilization across languages.

Sources

NarraBench: A Comprehensive Framework for Narrative Benchmarking

Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

Invisible Languages of the LLM Universe

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

PET Head Motion Estimation Using Supervised Deep Learning with Attention

Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation

Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Tahakom LLM guidelines and receipts: from pre-training data to an Arabic LLM

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

OmniGaze: Reward-inspired Generalizable Gaze Estimation In The Wild

Benchmarking Multimodal Large Language Models for Face Recognition

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Built with on top of