Evaluating and Enhancing Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a growing focus on evaluation and enhancement. Recent research has highlighted the limitations of traditional evaluation methodologies, which often rely on leaderboard rankings and fail to provide meaningful feedback. In response, there is a shift towards more comprehensive and fine-grained evaluation frameworks that can guide model optimization and profiling. Additionally, there is a growing emphasis on fairness and bias detection in LLMs, with novel approaches being proposed to prioritize metamorphic relations and detect implicit bias. Furthermore, researchers are exploring the use of psychometrics and normative feedback to evaluate and improve LLMs. Noteworthy papers include: From Rankings to Insights, which introduces a novel evaluation framework called Feedbacker, Efficient Fairness Testing in Large Language Models, which proposes a prioritization approach for metamorphic relations, Large Language Model Psychometrics, which provides a systematic review of the emerging field of LLM psychometrics, LCES, which formulates automated essay scoring as a pairwise comparison task, Beyond Likes, which proposes structured prosocial feedback as a complementary signal to likes and upvotes, WorldView-Bench, which introduces a benchmark for evaluating global cultural perspectives in LLMs, DIF, which provides a framework for benchmarking and verifying implicit bias in LLMs, J1, which introduces a reinforcement learning approach to training LLM-as-a-Judge models.

Sources

From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

Efficient Fairness Testing in Large Language Models: Prioritizing Metamorphic Relations for Bias Detection

Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Beyond Likes: How Normative Feedback Complements Engagement Signals on Social Media

WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models

DIF: A Framework for Benchmarking and Verifying Implicit Bias in LLMs

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Built with on top of