Evaluating Social Reasoning in Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a growing focus on evaluating their social reasoning capabilities. Recent studies have introduced novel benchmarks and frameworks to assess LLMs' ability to understand social contexts, infer others' mental states, and make decisions in complex scenarios. These developments highlight the importance of developing more human-aligned and value-sensitive LLMs. Notable papers in this area include DEL-ToM, which improves Theory-of-Mind reasoning through inference-time scaling, and ToMAP, which introduces a novel approach for building opponent-aware LLM persuaders. Additionally, benchmarks like SocialMaze and CK-Arena provide comprehensive evaluations of LLMs' social reasoning capabilities, revealing significant variations in their ability to handle dynamic interactions and integrate temporally evolving information. Overall, the field is moving towards more sophisticated and nuanced evaluations of LLMs' social reasoning abilities, with a focus on developing more effective and human-like models.

Evaluating Social Reasoning in Large Language Models

Sources