Advancements in Evaluating and Improving Large Language Models for Biomedical and Ethical Applications

The field of natural language processing is witnessing significant advancements in the evaluation and improvement of large language models (LLMs) for biomedical and ethical applications. Recent studies have focused on developing benchmarks and frameworks to assess the performance of LLMs in these domains, highlighting their potential and limitations. A notable trend is the emphasis on evaluating LLMs' ability to reason, synthesize knowledge, and demonstrate proper evidence, rather than just recalling memorized information. Furthermore, there is a growing recognition of the need to address the ethical implications of LLMs, including issues related to bias, fairness, and transparency. The development of frameworks such as SAFE-AI and the use of fuzzy approaches to specification, verification, and validation of risk-based ethical decision-making models are notable examples of this trend. Overall, the field is moving towards a more nuanced understanding of LLMs' capabilities and limitations, and the development of more robust and trustworthy models for biomedical and ethical applications. Noteworthy papers include BioPars, which introduces a pretrained biomedical large language model for Persian biomedical text mining, and HealthQA-BR, which provides a system-wide benchmark for Portuguese-speaking healthcare.

Sources

BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models

Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering

PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory

MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

The Confidence Paradox: Can LLM Know When It's Wrong

Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models

Pitfalls of Evaluating Language Models with Open Benchmarks

A Practical SAFE-AI Framework for Small and Medium-Sized Enterprises Developing Medical Artificial Intelligence Ethics Policies

A Fuzzy Approach to the Specification, Verification and Validation of Risk-Based Ethical Decision Making Models

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Built with on top of