Advancements in Human-Like Reasoning and Fairness in Large Language Models

The field of large language models (LLMs) is moving towards more human-like reasoning and fairness. Recent research has focused on developing benchmarks to evaluate the trustworthiness and moral reasoning of LLMs. These benchmarks assess the ability of LLMs to simulate human behavior, make moral decisions, and avoid biases. The results show that while LLMs have made significant progress, they still struggle with tasks that require deep understanding and nuance. Noteworthy papers in this area include HugAgent, which introduces a benchmark for evaluating LLMs' ability to simulate human-like individual reasoning, and MoReBench, which evaluates LLMs' moral reasoning abilities. Another notable paper is FinTrust, which presents a comprehensive benchmark for evaluating the trustworthiness of LLMs in finance applications.

Sources

HugAgent: Evaluating LLMs in Simulating Human-Like Individual Reasoning on Open-Ended Tasks

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain

"She's Like a Person but Better": Characterizing Companion-Assistant Dynamics in Human-AI Relationships

Algorithmic Fairness in AI Surrogates for End-of-Life Decision-Making

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

When AI companions become witty: Can human brain recognize AI-generated irony?

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

LLM-as-a-Prophet: Understanding Predictive Intelligence with Prophet Arena

A Justice Lens on Fairness and Ethics Courses in Computing Education: LLM-Assisted Multi-Perspective and Thematic Evaluation

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

When Your AI Agent Succumbs to Peer-Pressure: Studying Opinion-Change Dynamics of LLMs

"You Are Rejected!": An Empirical Study of Large Language Models Taking Hiring Evaluations

Are Large Language Models Sensitive to the Motives Behind Communication?

Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions

Individualized Cognitive Simulation in Large Language Models: Evaluating Different Cognitive Representation Methods

Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model