Advancements in Reinforcement Learning for Large Language Models

The field of reinforcement learning for large language models is rapidly evolving, with a focus on improving reasoning capabilities and mitigating issues such as entropy collapse. Recent developments have centered around designing novel reward signals, exploration strategies, and regularization techniques to enhance model performance. Notably, researchers have been exploring the use of flow rewards, uncertainty-aware advantage shaping, and adaptive entropy regularization to promote more efficient and effective learning. Additionally, there has been a growing interest in understanding the role of belief tracking and deviation in active reasoning, as well as the potential of representation-based exploration and last-token self-rewarding. These advancements have shown promising results in improving reasoning accuracy, exploration capability, and test-time sample efficiency.

Noteworthy papers include: RLFR, which proposes a novel perspective on shaping RLVR with flow rewards derived from latent space. Unlocking Exploration in RLVR, which introduces UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. Rediscovering Entropy Regularization, which proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation via three components. LaSeR, which proposes an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards.

Sources

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning

Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

How Reinforcement Learning After Next-Token Prediction Facilitates Learning

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

What is the objective of reasoning with reinforcement learning?

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

SimKO: Simple Pass@K Policy Optimization

Reasoning with Sampling: Your Base Model is Smarter Than You Think

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Built with on top of