Advances in Mathematical Reasoning for Large Language Models

The field of mathematical reasoning for large language models (LLMs) is rapidly advancing, with a focus on improving logical reasoning, numerical reasoning, and multilingual support. Recent developments have highlighted the importance of adaptive selection of symbolic languages, joint logical-numerical reasoning, and robust test-time ensemble methods. Notably, researchers are exploring new benchmarks and datasets to evaluate LLMs' mathematical reasoning capabilities, such as MATH-Beyond and MathMist. These efforts aim to push the boundaries of LLMs' abilities in mathematical reasoning, addressing current limitations and gaps in existing models.

Noteworthy papers include: Adaptive Selection of Symbolic Languages for Improving LLM Logical Reasoning, which proposes a method to improve logical reasoning performance by adaptively selecting the most suitable symbolic language for each problem. LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models, which introduces a flexible natural language problem synthesizer to generate tasks requiring joint logical and numerical reasoning. MATH-Beyond, a benchmark designed to defeat common open-source models and require methods that learn to reason in ways that go beyond base model capabilities. Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval, which achieves state-of-the-art performance on financial numerical reasoning datasets using a novel two-step framework. MathMist, a parallel multilingual benchmark dataset for mathematical problem solving and reasoning, which reveals persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages.

Sources

Adaptive Selection of Symbolic Languages for Improving LLM Logical Reasoning

LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model

BanglaMATH : A Bangla benchmark dataset for testing LLM mathematical reasoning at grades 6, 7, and 8

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval

Harnessing Consistency for Robust Test-Time LLM Ensemble

MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Built with on top of