Advancements in Large Language Model Reasoning

The field of large language model (LLM) reasoning is rapidly advancing, with a focus on improving the efficiency and effectiveness of reinforcement learning (RL) techniques. Recent studies have explored the interplay between supervised finetuning (SFT) and RL, highlighting the importance of backtracking in enhancing LLM reasoning capabilities. Additionally, researchers have proposed novel methods for code-integrated reasoning, selective rollouts, and angle-informed navigation, which have shown promising results in improving training efficiency and model performance. Noteworthy papers in this area include 'How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning', which investigates the dynamics between SFT and RL on various reasoning tasks, and 'Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals', which proposes a gradient-driven angle-informed navigated RL framework for improving training efficiency. Overall, the field is moving towards more innovative and efficient approaches to LLM reasoning, with a focus on advancing the state-of-the-art in RL and SFT techniques.

Sources

How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

Towards Effective Code-Integrated Reasoning

Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals

Response-Level Rewards Are All You Need for Online Reinforcement Learning in LLMs: A Mathematical Perspective

FreePRM: Training Process Reward Models Without Ground Truth Process Labels

Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning

On the Mechanism of Reasoning Pattern Selection in Reinforcement Learning for Language Models

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Multi-Layer GRPO: Enhancing Reasoning and Self-Correction in Large Language Models

LogicPuzzleRL: Cultivating Robust Mathematical Reasoning in LLMs via Reinforcement Learning

TreeRPO: Tree Relative Policy Optimization

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay