Advances in Model Evaluation and Selection

The field of model evaluation and selection is moving towards more innovative and robust methods. Recent studies have focused on developing uncertainty-guided strategies for model selection, unsupervised model evaluation and ranking, and data-efficient evaluation of large language models. These approaches aim to improve the accuracy and reliability of model predictions, particularly in scenarios where labeled data is scarce or unavailable. Noteworthy papers in this area include: Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction, which demonstrates the effectiveness of using model uncertainty as a heuristic for optimizing biomolecule efficacy predictions. Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking, which presents a unified framework for unsupervised model evaluation and ranking using confidence and dispersity as complementary signals. Toward a unified framework for data-efficient evaluation of large language models, which introduces a novel framework for data-efficient LLM evaluation using Item Response Theory. Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation, which presents a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials with posterior estimates of a model's underlying success probability and credible intervals. Measuring Language Model Hallucinations Through Distributional Correctness, which introduces a novel evaluation metric, the Distributional Correctness Score, to solve the problem of not considering a model's entire probability distribution over answer choices. Instability in Downstream Task Performance During LLM Pretraining, which empirically analyzes the stability of downstream task performance in an LLM trained on diverse web-scale corpora and investigates two post-hoc checkpoint integration methods to address this instability. Making and Evaluating Calibrated Forecasts, which introduces a perfectly truthful calibration measure for multi-class prediction tasks and studies common methods of extending calibration measures from binary to multi-class prediction. How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation, which empirically shows the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics and provides concrete recommendations for constructing more robust and realistic benchmarks. PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs, which introduces a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs.

Advances in Model Evaluation and Selection

Sources