Advances in Large Language Model Reasoning

The field of large language model reasoning is moving towards more effective and efficient methods for aligning models with human-annotated demonstrations and improving their reasoning capabilities. Recent developments have focused on addressing the limitations of supervised fine-tuning and reinforcement learning with verifiable rewards, such as overfitting, poor out-of-domain generalization, and capability regression. Innovative approaches include the use of on-policy techniques, self-rewarding mechanisms, and progressive reward structures to enhance generalization performance and data efficiency. Additionally, there is a growing interest in developing more robust and reliable reasoning models that can mitigate the risks of flawed-positive rollouts and reward hacking. Notable papers in this area include Self-Rewarding PPO, which combines the strengths of supervised fine-tuning and proximal policy optimization to achieve more effective alignment from demonstration data. Another noteworthy paper is FAPO, which proposes a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage while gradually shifting optimization toward reliable reasoning in the later refinement stage.

Sources

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Policy Optimization Prefers The Path of Least Resistance

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

PACR: Progressively Ascending Confidence Reward for LLM Reasoning

FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

MGFRec: Towards Reinforced Reasoning Recommendation with Multiple Groundings and Feedback

Think before Recommendation: Autonomous Reasoning-enhanced Recommender

MR-Align: Meta-Reasoning Informed Factuality Alignment for Large Reasoning Models

Reasoning-Aware GRPO using Process Mining

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Zero Reinforcement Learning Towards General Domains

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math

Think Outside the Policy: In-Context Steered Policy Optimization