Advances in Large Language Model Evaluation and Calibration

The field of large language models (LLMs) is rapidly advancing, with a focus on improving evaluation and calibration methods. Recent research has highlighted the importance of developing more accurate and reliable methods for assessing LLM performance, particularly in high-stakes applications. One key area of development is the use of geometric properties of internal model representations to evaluate generated text quality, which has shown promise as a reference-free approach. Additionally, there is a growing interest in abstention models, which can recognize when an LLM is unsure or lacks knowledge to answer a question, and can either seek external help or abstain from responding. Noteworthy papers in this area include: Evidence for Limited Metacognition in LLMs, which introduces a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Detecting (Un)answerability in Large Language Models with Linear Directions, which proposes a simple approach for identifying a direction in the model's activation space that captures unanswerability and uses it for classification. Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions, which introduces a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within a subspace. Generalized Correctness Models, which proposes multiple methods to inject historical correctness information into a Correctness Model, creating a Generalized Correctness Model that can be trained on the correctness data from many LLMs. From Internal Representations to Text Quality, which demonstrates that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. Pay-Per-Search Models are Abstention Models, which introduces a training framework that readily extracts abstentions from LLMs using reinforcement learning with a pay-per-search reward. A-VERT, which presents a structure-free evaluation method that makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text. CLUE, which explores hidden states directly as a unified foundation for verification and presents a deliberately minimalist, non-parametric verifier.

Advances in Large Language Model Evaluation and Calibration

Sources