Evaluating Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving their evaluation and assessment. Recent research has highlighted the importance of developing more robust and reliable evaluation frameworks, as current methods may be flawed or biased. One of the key challenges in evaluating LLMs is the potential for overestimation or contamination of results, which can lead to unfair comparisons between models. To address this issue, researchers are exploring new approaches, such as dynamic evaluation frameworks and benchmark-free paradigms, that can provide more accurate and transparent assessments of LLM performance. Another area of focus is the development of more comprehensive and diverse benchmarks that can test the capabilities of LLMs in a wide range of tasks and domains. Notable papers in this area include the proposal of ArxivRoll, a dynamic evaluation framework that constructs a new benchmark every six months using recent articles from ArXiv, and LLM-Crowdsourced, a benchmark-free paradigm that utilizes LLMs to generate questions, answer independently, and evaluate mutually. These innovative approaches are advancing the field and providing new insights into the strengths and limitations of LLMs.

Sources

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)

Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Modeling Professionalism in Expert Questioning through Linguistic Differentiation

Data-Driven and Participatory Approaches toward Neuro-Inclusive AI

CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models

CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

BAR Conjecture: the Feasibility of Inference Budget-Constrained LLM Services with Authenticity and Reasoning

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities