Advances in Multilingual and Cultural Evaluation of Large Language Models

The field of natural language processing is moving towards a more inclusive and diverse evaluation of large language models (LLMs), with a focus on low-resource languages and cultural contexts. Recent studies have highlighted the limitations of current LLMs in understanding and generating text in non-English languages, and have introduced new benchmarks and datasets to address these gaps. These efforts aim to promote more accurate and culturally sensitive representations of diverse languages and cultures. Notable papers in this area include MELAC, which introduced a comprehensive evaluation dataset for Persian language and Iranian culture, and FilBench, which presented a Filipino-centric benchmark for evaluating LLMs. Additionally, MyCulture and MELLA proposed innovative approaches to evaluating LLMs on Malaysian culture and low-resource languages, respectively. TASE introduced a comprehensive benchmark for evaluating LLMs' token-level understanding and structural reasoning across languages.

Sources

MELAC: Massive Evaluation of Large Language Models with Alignment of Culture in Persian Language

FilBench: Can LLMs Understand and Generate Filipino?

The TUB Sign Language Corpus Collection

MyCulture: Exploring Malaysia's Diverse Culture under Low-Resource Language Constraints

TASE: Token Awareness and Structured Evaluation for Multilingual Language Models

MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Built with on top of