Evaluating Social Reasoning in Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a growing focus on evaluating their social reasoning capabilities. Recent studies have introduced novel benchmarks and frameworks to assess LLMs' ability to understand social contexts, infer others' mental states, and make decisions in complex scenarios. These developments highlight the importance of developing more human-aligned and value-sensitive LLMs. Notable papers in this area include DEL-ToM, which improves Theory-of-Mind reasoning through inference-time scaling, and ToMAP, which introduces a novel approach for building opponent-aware LLM persuaders. Additionally, benchmarks like SocialMaze and CK-Arena provide comprehensive evaluations of LLMs' social reasoning capabilities, revealing significant variations in their ability to handle dynamic interactions and integrate temporally evolving information. Overall, the field is moving towards more sophisticated and nuanced evaluations of LLMs' social reasoning abilities, with a focus on developing more effective and human-like models.

Sources

Social preferences with unstable interactive reasoning: Large language models in economic trust games

DEL-ToM: Inference-Time Scaling for Theory-of-Mind Reasoning via Dynamic Epistemic Logic

Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States

Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity

The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

LLM Agents for Bargaining with Utility-based Feedback

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Built with on top of