Advances in Large Language Model Evaluation and Calibration

The field of large language models (LLMs) is rapidly advancing, with a focus on improving evaluation and calibration methods. Recent research has highlighted the importance of developing more accurate and reliable methods for assessing LLM performance, particularly in high-stakes applications. One key area of development is the use of geometric properties of internal model representations to evaluate generated text quality, which has shown promise as a reference-free approach. Additionally, there is a growing interest in abstention models, which can recognize when an LLM is unsure or lacks knowledge to answer a question, and can either seek external help or abstain from responding. Noteworthy papers in this area include: Evidence for Limited Metacognition in LLMs, which introduces a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Detecting (Un)answerability in Large Language Models with Linear Directions, which proposes a simple approach for identifying a direction in the model's activation space that captures unanswerability and uses it for classification. Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions, which introduces a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within a subspace. Generalized Correctness Models, which proposes multiple methods to inject historical correctness information into a Correctness Model, creating a Generalized Correctness Model that can be trained on the correctness data from many LLMs. From Internal Representations to Text Quality, which demonstrates that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. Pay-Per-Search Models are Abstention Models, which introduces a training framework that readily extracts abstentions from LLMs using reinforcement learning with a pay-per-search reward. A-VERT, which presents a structure-free evaluation method that makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text. CLUE, which explores hidden states directly as a unified foundation for verification and presents a deliberately minimalist, non-parametric verifier.

Sources

Evidence for Limited Metacognition in LLMs

Detecting (Un)answerability in Large Language Models with Linear Directions

Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

Reference-Free Rating of LLM Responses via Latent Information

Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

From Internal Representations to Text Quality: A Geometric Approach to LLM Evaluation

Pay-Per-Search Models are Abstention Models

A-VERT: Agnostic Verification with Embedding Ranking Targets

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering

Built with on top of