Advancements in Reasoning and Cognitive Intelligence

The field of artificial intelligence is moving towards advancing the reasoning abilities of AI models by mirroring human-like cognitive intelligence. Recent research has focused on developing large language models and evaluating their performance on various benchmarks, highlighting the need for more systematic and comprehensive approaches to assessing cognitive abilities. A key direction is the development of more nuanced and fine-grained evaluation protocols that can diagnose the strengths and weaknesses of AI models, such as their ability to detect errors and inconsistencies. Noteworthy papers include: Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective, which provides a comprehensive review of math word problem solving through the lens of human cognition. Reasoning Models Reason Well, Until They Don't, which develops a new dataset to evaluate model performance on graph connectivity and natural language proof planning, showing that existing benchmarks have limited complexity. PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion, which introduces a novel conjunction-based segmentation strategy to generate coherent sentence-completion pairs. Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement, which introduces a residual disentanglement method to computationally isolate components of language processing. PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection, which introduces a diagnostic task to evaluate not only whether models can solve problems but how their reasoning unfolds. RiddleBench: A New Generative Reasoning Benchmark for LLMs, which probes core reasoning capabilities that require integrating logical deduction with spatial awareness and constraint satisfaction. Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?, which examines whether large language models can estimate the cognitive complexity of reading comprehension items. Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings, which proposes an item-centric approach to benchmark subset selection based on the intrinsic properties of the task items themselves.

Sources

Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective

Reasoning Models Reason Well, Until They Don't

PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion

Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

RiddleBench: A New Generative Reasoning Benchmark for LLMs

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Built with on top of