Evaluating and Improving Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving evaluation methods and addressing safety concerns. Researchers are developing new frameworks for efficiently estimating the relative capabilities of LLMs, such as the Structured Transition Evaluation Method (STEM), which identifies significant transition samples to estimate model capabilities. Additionally, there is a growing emphasis on ensuring the safety of LLMs across diverse linguistic and cultural contexts, with the introduction of comprehensive multilingual safety benchmarks like LinguaSafe. Furthermore, studies are investigating the limitations of in-context learning in LLMs and proposing novel scoring functions, such as scaled signed averaging (SSA), to improve performance. Noteworthy papers include:

  • STEM, which proposes a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs.
  • LinguaSafe, a comprehensive multilingual safety benchmark that addresses the critical need for multilingual safety evaluations of LLMs.
  • Signal and Noise, a framework for reducing uncertainty in language model evaluation by analyzing specific properties that make a benchmark more reliable.
  • Compressed Models are NOT Trust-equivalent to Their Large Counterparts, which highlights the importance of careful assessment when deploying compressed models.
  • Improving in-context learning with a better scoring function, which proposes SSA to improve performance on tasks involving first-order quantifiers and linear functions.

Sources

STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples

LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models

Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation

Compressed Models are NOT Trust-equivalent to Their Large Counterparts

Explainable Graph Spectral Clustering For Text Embeddings

Improving in-context learning with a better scoring function

A Survey on Large Language Model Benchmarks

Built with on top of