Advances in Language Models and Automated Reasoning

The field of natural language processing and automated reasoning is rapidly advancing, with a focus on developing more sophisticated language models and benchmarks to evaluate their performance. Recent research has highlighted the importance of pragmatics, sentiment analysis, and reasoning capabilities in language models, with several new benchmarks and datasets being introduced to address these challenges. Notably, the development of benchmarks such as SloPragEval, QuArch, and AMO-Bench has pushed the boundaries of language model evaluation, while advances in vision-language models have enabled the automated interpretation of complex documents and images. Furthermore, research on automated reasoning has led to the creation of new frameworks and tools, such as WaveVerif and Lean4PHYS, which have the potential to revolutionize fields such as robotics and physics. Some noteworthy papers include SloPragEval, which introduced the first pragmatics understanding benchmarks for Slovene, and QuArch, which presented a benchmark for evaluating LLM reasoning in computer architecture. Additionally, AMO-Bench introduced a new benchmark for mathematical reasoning with Olympiad-level difficulty, highlighting the significant room for improvement in current language models.

Sources

From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene

OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment

A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

QuArch: A Benchmark for Evaluating LLM Reasoning in Computer Architecture

SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language

VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions

Can Language Models Compose Skills In-Context?

Multi-Stage Field Extraction of Financial Documents with OCR and Compact Vision-Language Models

Floating-Point Neural Network Verification at the Software Level

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

RLMEval: Evaluating Research-Level Neural Theorem Proving

Are Language Models Efficient Reasoners? A Perspective from Logic Programming

WaveVerif: Acoustic Side-Channel based Verification of Robotic Workflows

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions