The field of mathematical reasoning for large language models (LLMs) is rapidly advancing, with a focus on improving logical reasoning, numerical reasoning, and multilingual support. Recent developments have highlighted the importance of adaptive selection of symbolic languages, joint logical-numerical reasoning, and robust test-time ensemble methods. Notably, researchers are exploring new benchmarks and datasets to evaluate LLMs' mathematical reasoning capabilities, such as MATH-Beyond and MathMist. These efforts aim to push the boundaries of LLMs' abilities in mathematical reasoning, addressing current limitations and gaps in existing models.
Noteworthy papers include: Adaptive Selection of Symbolic Languages for Improving LLM Logical Reasoning, which proposes a method to improve logical reasoning performance by adaptively selecting the most suitable symbolic language for each problem. LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models, which introduces a flexible natural language problem synthesizer to generate tasks requiring joint logical and numerical reasoning. MATH-Beyond, a benchmark designed to defeat common open-source models and require methods that learn to reason in ways that go beyond base model capabilities. Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval, which achieves state-of-the-art performance on financial numerical reasoning datasets using a novel two-step framework. MathMist, a parallel multilingual benchmark dataset for mathematical problem solving and reasoning, which reveals persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages.