Evaluating Large Language Models in Healthcare

The field of natural language processing is moving towards more nuanced and accurate evaluation methodologies for large language models (LLMs) in high-stakes applications such as healthcare. Traditional accuracy metrics are being supplemented with more comprehensive frameworks that capture topic-specific insights, question characteristics, and model abilities. This shift is driven by the need for reliable and trustworthy deployment of LLMs in healthcare settings. Noteworthy papers in this area include:

  • A study introducing a rigorous evaluation framework grounded in Item Response Theory, which estimates model ability jointly with question difficulty and discrimination.
  • A community-driven evaluation pipeline that enables scalable and automated benchmarking of LLMs in healthcare chatbot settings, highlighting the importance of culturally aware and inclusive evaluation methodologies.
  • A model that applies cognitive diagnosis to teacher-student dialogues, providing a powerful tool for assessing students' cognitive states.
  • A comprehensive evaluation of doctor agents' inquiry capability, revealing substantial challenges in medical multi-turn questioning and highlighting the need for more nuanced evaluation frameworks.
  • A metric that measures knowledge-aware refusal in factual tasks, providing insight into an important but previously overlooked aspect of LLM factuality.

Sources

Beyond Overall Accuracy: A Psychometric Deep Dive into the Topic-Specific Medical Capabilities of 80 Large Language Models

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework

The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Built with on top of