Reinforcement Learning for Large Language Models

The field of reinforcement learning for large language models is moving towards improving exploration and reasoning capabilities. Recent developments focus on addressing the limitations of current methods, such as the loss of exploration and the reliance on base models. Innovations include new regularization techniques, curriculum learning, and data-centric interventions to overcome the zero-reward barrier. Noteworthy papers include Low-probability Regularization, which enables stable on-policy training and achieves state-of-the-art performance, and Unlocking Reasoning Capabilities, which proposes RAPO to promote broader yet focused exploration. Other notable works include Slow-Fast Policy Optimization, Selective Expert Guidance, and XRPO, which introduce efficient frameworks and mechanisms to enhance exploration and exploitation.

Sources

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Studying the Korean Word-Chain Game with RLVR:Mitigating Reward Conflicts via Curriculum Learning

What Can You Do When You Have Zero Rewards During RL?

Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Built with on top of