The field of large language models (LLMs) is rapidly evolving, with a focus on improving their evaluation and assessment. Recent research has highlighted the importance of developing more robust and reliable evaluation frameworks, as current methods may be flawed or biased. One of the key challenges in evaluating LLMs is the potential for overestimation or contamination of results, which can lead to unfair comparisons between models. To address this issue, researchers are exploring new approaches, such as dynamic evaluation frameworks and benchmark-free paradigms, that can provide more accurate and transparent assessments of LLM performance. Another area of focus is the development of more comprehensive and diverse benchmarks that can test the capabilities of LLMs in a wide range of tasks and domains. Notable papers in this area include the proposal of ArxivRoll, a dynamic evaluation framework that constructs a new benchmark every six months using recent articles from ArXiv, and LLM-Crowdsourced, a benchmark-free paradigm that utilizes LLMs to generate questions, answer independently, and evaluate mutually. These innovative approaches are advancing the field and providing new insights into the strengths and limitations of LLMs.
Evaluating Large Language Models
Sources
How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework
Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory
CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting