The field of large language models (LLMs) is moving towards more culturally and linguistically inclusive development, with a focus on evaluating and improving their instruction-following capabilities, social reasoning, and creative storytelling. Researchers are introducing new benchmarks and datasets to assess LLMs' performance in diverse languages and tasks, such as the Korean Instruction-following Task Evaluation (KITE) and the SCRIPTS dataset for social reasoning in English and Korean dialogues. These efforts aim to address the limitations and biases of current LLMs and inspire further research in this area. Notable papers include:
- KITE, a comprehensive benchmark for evaluating Korean instruction-following abilities in LLMs, which provides a valuable tool for assessing and improving their performance in this area.
- Qomhra, a bilingual Irish-English LLM that demonstrates significant gains in Irish and English performance and undergoes instruction tuning, showing clear progress in instruction following.