Evaluating and Improving Large Language Models

The field of large language models (LLMs) is moving towards addressing concerns around temporal prediction, factuality evaluation, and community detection. Researchers are investigating the effectiveness of prompting-based unlearning techniques to simulate earlier knowledge cutoffs in LLMs, as well as developing new benchmarks and evaluation metrics to assess the reliability of LLMs in various tasks. Notably, the importance of considering the temporal evolution of communities and the impact of benchmark aging on LLM factuality evaluation are being highlighted.

Some noteworthy papers in this area include: DynBenchmark, which proposes a new community-centered model to generate customizable evolving community structures. All Claims Are Equal, but Some Claims Are More Equal Than Others, which introduces a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims. When Benchmarks Age, which presents a systematic investigation of the issue of benchmark aging and its impact on LLM factuality evaluation.

Evaluating and Improving Large Language Models

Sources