Advancements in Reinforcement Learning for Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). Recent developments have highlighted the importance of efficiently utilizing distributional information, exploring token space, and optimizing policy updates. Notably, researchers are investigating novel methods to address limitations in current RLVR approaches, such as incorporating mixture-of-token generation and exploiting zero-variance prompts. These innovations have led to significant improvements in reasoning performance and training efficiency.

Noteworthy papers include: Learning to Reason with Mixture of Tokens, which presents a unified framework for mixture-of-token generation in RLVR, achieving substantial improvements in reasoning performance. No Prompt Left Behind, which introduces a novel algorithm that extracts learning signals from zero-variance prompts, achieving significant improvements in accuracy and pass rate. ExGRPO, which proposes a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation, consistently improving reasoning performance on mathematical and general benchmarks.

Sources

Learning to Reason with Mixture of Tokens

No Prompt Left Behind: Exploiting Zero-Variance Prompts in LLM Reinforcement Learning via Entropy-Guided Advantage Shaping

RL in the Wild: Characterizing RLVR Training in LLM Deployment

Improving Sampling Efficiency in RLVR through Adaptive Rollout and Response Reuse

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

ExGRPO: Learning to Reason from Experience

Built with on top of