Evaluating Large Language Models

The field of natural language processing is moving towards more robust and reliable evaluation methods for large language models (LLMs). Current research focuses on developing efficient and effective approaches to assess the quality of LLMs, including dialogue evaluation, sentiment analysis, and question difficulty estimation. Noteworthy papers in this area include Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges, which proposes an efficient method for aggregating multiple LLM judges to evaluate dialogue quality. GrandJury introduces a collaborative machine learning model evaluation protocol for dynamic quality rubrics, enabling pluralistic and accountable evaluation of LLM outputs.

Sources

Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple Judges

GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Dialogues Aspect-based Sentiment Quadruple Extraction via Structural Entropy Minimization Partitioning

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes