Evaluating Long-form Question Answering and Language Models

The field of natural language processing is moving towards more advanced evaluation methods for long-form question answering and language models. Researchers are exploring new metrics and approaches to assess the quality of generated answers, including the use of nugget evaluation methodologies and human evaluations. The development of benchmarks, such as HCT-QA, is also providing new opportunities for testing and improving language models. Furthermore, the application of large language models to real-world problems, such as accident data collection and analysis, is becoming increasingly prominent. Notable papers in this area include: An Empirical Study of Evaluating Long-form Question Answering, which investigates the limitations of existing evaluation metrics and proposes improvements. Chatbot Arena Meets Nuggets, which applies the AutoNuggetizer framework to analyze data from Search Arena battles and shows a significant correlation between nugget scores and human preferences.

Sources

An Empirical Study of Evaluating Long-form Question Answering

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

HCT-QA: A Benchmark for Question Answering on Human-Centric Tables

Are Information Retrieval Approaches Good at Harmonising Longitudinal Survey Questions in Social Science?

From Precision to Perception: User-Centred Evaluation of Keyword Extraction Algorithms for Internet-Scale Contextual Advertising

Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

ConSens: Assessing context grounding in open-book question answering

Built with on top of