Evaluating and Improving Large Language Models

The field of natural language processing is moving towards developing more robust and reliable evaluation methods for large language models (LLMs). Recent research has focused on creating comprehensive benchmarks and frameworks to assess the performance of LLMs in various tasks, such as automatic survey generation, question answering, and text summarization. These efforts aim to address the limitations of existing evaluation methods, including biased metrics and over-reliance on LLMs-as-judges. Notably, some papers have introduced innovative approaches to evaluating LLMs, such as using human preference metrics and combining LLM-based scoring with quantitative metrics. Others have explored the use of synthetic data and dimension reduction techniques to improve evaluation frameworks. Overall, the field is shifting towards more nuanced and multifaceted evaluation methods to advance the development of LLMs. Noteworthy papers include SGSimEval, which introduces a comprehensive benchmark for survey generation, and LongRecall, which presents a structured approach for robust recall evaluation in long-form text. Additionally, the paper on evaluating knowledge graph complexity via semantic, spectral, and structural metrics provides valuable insights into understanding dataset complexity.

Sources

SGSimEval: A Comprehensive Multifaceted and Similarity-Enhanced Benchmark for Automatic Survey Generation Systems

Can we Evaluate RAGs with Synthetic Data?

Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering

DIT: Dimension Reduction View on Optimal NFT Rarity Meters

TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain

The illusion of a perfect metric: Why evaluating AI's words is harder than it looks

LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

Identifying and Answering Questions with False Assumptions: An Interpretable Approach

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?

Evaluating Knowledge Graph Complexity via Semantic, Spectral, and Structural Metrics for Link Prediction

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

Built with on top of