The field of large language models (LLMs) is rapidly advancing, with a focus on improving reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). Recent developments have highlighted the importance of efficiently utilizing distributional information, exploring token space, and optimizing policy updates. Notably, researchers are investigating novel methods to address limitations in current RLVR approaches, such as incorporating mixture-of-token generation and exploiting zero-variance prompts. These innovations have led to significant improvements in reasoning performance and training efficiency.
Noteworthy papers include: Learning to Reason with Mixture of Tokens, which presents a unified framework for mixture-of-token generation in RLVR, achieving substantial improvements in reasoning performance. No Prompt Left Behind, which introduces a novel algorithm that extracts learning signals from zero-variance prompts, achieving significant improvements in accuracy and pass rate. ExGRPO, which proposes a framework that organizes and prioritizes valuable experiences, and employs a mixed-policy objective to balance exploration with experience exploitation, consistently improving reasoning performance on mathematical and general benchmarks.