Stability and Evaluation of Large Language Models

The field of large language models (LLMs) is moving towards a greater emphasis on stability and reliability in evaluations. Researchers are exploring ways to reduce the volatility of evaluation scores and ensure fairness in model comparisons. One key area of focus is the development of new methods for evaluating LLMs, such as instance-level randomization and multi-to-one interview paradigms, which aim to provide more robust and efficient assessments of model performance. Additionally, there is a growing recognition of the importance of careful evaluation design, including the need for standardized and transparent protocols. Noteworthy papers in this area include: Instance-level Randomization: Toward More Stable LLM Evaluations, which proposes a method to reduce variance and enhance fairness in model comparisons. Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs, which highlights the importance of careful evaluation design and recommends a specific tokenization strategy to improve model performance and calibration.

Sources

The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

Instance-level Randomization: Toward More Stable LLM Evaluations

A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Built with on top of