Advancements in Mathematical Reasoning with Large Language Models

The field of mathematical reasoning with large language models (LLMs) is rapidly advancing, with a focus on developing more comprehensive and robust evaluation benchmarks. Recent research has highlighted the importance of moving beyond traditional benchmarks and exploring new frontiers in mathematical reasoning, such as research-level proof generation and nonstandard problem-solving techniques. The development of novel frameworks, such as co-evolutionary loops and generative process supervision, has also shown promise in improving the performance of LLMs on complex mathematical tasks. Noteworthy papers in this area include: StepORLM, which introduces a self-evolving framework with generative process supervision and achieves state-of-the-art results on six benchmarks. BeyondBench, which proposes a benchmark-free evaluation framework using algorithmic problem generation and reveals consistent reasoning deficiencies across model families. IMProofBench, which introduces a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians and evaluates the performance of LLMs on research-level mathematical proof generation. EEFSUVA, which presents a novel benchmark curated from under circulated regional and national Olympiads and suggests the potential importance of broader evaluation datasets for a fuller assessment of mathematical reasoning. SKYLENAGE, which releases two complementary benchmarks for multi-level math evaluation and analyzes subject x model and grade x model performance. OR-Toolformer, which fine-tunes Llama-3.1-8B-Instruct with a semi-automatic data synthesis pipeline and augments the model with external solvers to produce API calls, achieving up to 80.1% execution accuracy on standard benchmarks.

Advancements in Mathematical Reasoning with Large Language Models

Sources