Evaluating and Improving Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving evaluation methods and addressing safety concerns. Researchers are developing new frameworks for efficiently estimating the relative capabilities of LLMs, such as the Structured Transition Evaluation Method (STEM), which identifies significant transition samples to estimate model capabilities. Additionally, there is a growing emphasis on ensuring the safety of LLMs across diverse linguistic and cultural contexts, with the introduction of comprehensive multilingual safety benchmarks like LinguaSafe. Furthermore, studies are investigating the limitations of in-context learning in LLMs and proposing novel scoring functions, such as scaled signed averaging (SSA), to improve performance. Noteworthy papers include:

STEM, which proposes a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs.
LinguaSafe, a comprehensive multilingual safety benchmark that addresses the critical need for multilingual safety evaluations of LLMs.
Signal and Noise, a framework for reducing uncertainty in language model evaluation by analyzing specific properties that make a benchmark more reliable.
Compressed Models are NOT Trust-equivalent to Their Large Counterparts, which highlights the importance of careful assessment when deploying compressed models.
Improving in-context learning with a better scoring function, which proposes SSA to improve performance on tasks involving first-order quantifiers and linear functions.

Evaluating and Improving Large Language Models

Sources