The field of large language models (LLMs) is moving towards more human-like reasoning and fairness. Recent research has focused on developing benchmarks to evaluate the trustworthiness and moral reasoning of LLMs. These benchmarks assess the ability of LLMs to simulate human behavior, make moral decisions, and avoid biases. The results show that while LLMs have made significant progress, they still struggle with tasks that require deep understanding and nuance. Noteworthy papers in this area include HugAgent, which introduces a benchmark for evaluating LLMs' ability to simulate human-like individual reasoning, and MoReBench, which evaluates LLMs' moral reasoning abilities. Another notable paper is FinTrust, which presents a comprehensive benchmark for evaluating the trustworthiness of LLMs in finance applications.
Advancements in Human-Like Reasoning and Fairness in Large Language Models
Sources
"She's Like a Person but Better": Characterizing Companion-Assistant Dynamics in Human-AI Relationships
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation