The field of medical applications of large language models (LLMs) is rapidly advancing, with a focus on improving the accuracy, reliability, and explainability of these models in clinical settings. Recent developments have centered on creating more comprehensive evaluation frameworks, such as MedHELM, which assess LLM performance across a range of medical tasks, and hybrid approaches like MedOrchestra, which combine the strengths of cloud and local LLMs to preserve data privacy while maintaining performance. Furthermore, innovations in multimodal large language models (MLLMs), like Infi-Med, are enhancing the reasoning capabilities and resource efficiency of these models, making them more suitable for real-world healthcare applications. The integration of LLMs in clinical decision-making is also being explored, with models like MedRAG and AutoCT demonstrating potential in supporting diagnosis, treatment, and clinical trial predictions. Noteworthy papers include MedHELM, which introduced a systematic comparison of LLMs with improved evaluation methods, and MedOrchestra, which proposed a hybrid framework for clinical data interpretation. Additionally, Infi-Med achieved state-of-the-art performance in general medical reasoning while maintaining rapid adaptability to clinical scenarios. SearchAI, a generative AI approach, enhanced the accuracy and efficiency of searching clinical data, outperforming traditional methods. Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs demonstrated significant improvements in factual accuracy and explainability. MedPAIR introduced a dataset to evaluate how physician trainees and LLMs prioritize relevant information when answering medical questions, finding that LLMs often are not aligned with physician trainees' content relevance estimates. LLMEval-Med presented a real-world clinical benchmark for medical LLMs with physician validation, covering five core medical areas and 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. ArchEHR-QA introduced a dataset for addressing patient's information needs related to clinical course of hospitalization, evaluating three open-weight large language models across three prompting strategies. AutoCT proposed a framework that combines the reasoning capabilities of large language models with the explainability of classical machine learning for clinical trial prediction. MedAgentGym introduced a training environment designed to enhance coding-based medical reasoning capabilities in large language model agents. Truth in the Few proposed a novel data selection paradigm termed Reasoning Activation Potential (RAP), which identifies cognitive samples by estimating each sample's potential to stimulate genuine multi-modal reasoning.
Advancements in Medical Applications of Large Language Models
Sources
Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs
Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis