The field of reinforcement learning for large language models is rapidly evolving, with a focus on improving reasoning capabilities and mitigating issues such as entropy collapse. Recent developments have centered around designing novel reward signals, exploration strategies, and regularization techniques to enhance model performance. Notably, researchers have been exploring the use of flow rewards, uncertainty-aware advantage shaping, and adaptive entropy regularization to promote more efficient and effective learning. Additionally, there has been a growing interest in understanding the role of belief tracking and deviation in active reasoning, as well as the potential of representation-based exploration and last-token self-rewarding. These advancements have shown promising results in improving reasoning accuracy, exploration capability, and test-time sample efficiency.
Noteworthy papers include: RLFR, which proposes a novel perspective on shaping RLVR with flow rewards derived from latent space. Unlocking Exploration in RLVR, which introduces UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. Rediscovering Entropy Regularization, which proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation via three components. LaSeR, which proposes an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards.