Efficient Reasoning in Large Language Models

The field of large language models is moving towards more efficient and effective reasoning capabilities. Recent studies have demonstrated that scaling reasoning can improve factuality in large language models, and that adaptive token allocation and token-efficient inference scaling can lead to significant improvements in performance. The use of cost-aware planning and budget allocation has also been shown to enhance reasoning accuracy and efficiency. Notably, the introduction of speculative parallel scaling reasoning and length-based reward shaping has achieved strong gains in mathematical reasoning benchmarks. Furthermore, the development of frameworks such as Plan-and-Budget and Cost-Augmented Monte Carlo Tree Search has improved reasoning efficiency across a range of tasks and models. Some particularly noteworthy papers in this regard include: SelfBudgeter, which proposes a self-adaptive controllable reasoning strategy for efficient reasoning, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy. A*-Decoding, which introduces a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget, reaching the performance levels of strong inference scaling baselines while using up to 3x fewer tokens and 30% fewer PRM passes under equivalent compute budgets.

Efficient Reasoning in Large Language Models

Sources