Evaluating and Improving Large Language Models

The field of large language models (LLMs) is moving towards addressing concerns around temporal prediction, factuality evaluation, and community detection. Researchers are investigating the effectiveness of prompting-based unlearning techniques to simulate earlier knowledge cutoffs in LLMs, as well as developing new benchmarks and evaluation metrics to assess the reliability of LLMs in various tasks. Notably, the importance of considering the temporal evolution of communities and the impact of benchmark aging on LLM factuality evaluation are being highlighted.

Some noteworthy papers in this area include: DynBenchmark, which proposes a new community-centered model to generate customizable evolving community structures. All Claims Are Equal, but Some Claims Are More Equal Than Others, which introduces a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims. When Benchmarks Age, which presents a systematic investigation of the issue of benchmark aging and its impact on LLM factuality evaluation.

Sources

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

DynBenchmark: Customizable Ground Truths to Benchmark Community Detection and Tracking in Temporal Networks

Visualization of Interpersonal Communication using Indoor Positioning Technology with UWB Tags

All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Built with on top of