Advances in Reinforcement Learning for Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving reasoning capabilities through reinforcement learning (RL). Recent developments have highlighted the importance of balancing exploration and exploitation in RL, with innovative approaches such as entropy-based mechanisms and adaptive guidance. These methods have shown significant improvements in performance on various benchmarks, including mathematical reasoning and code generation tasks. Notably, the use of dynamic weighting and controllable harmonization of on- and off-policy reinforcement learning has demonstrated stable and efficient learning processes. Furthermore, the introduction of evolutionary testing and automated benchmark generation has enabled the creation of more challenging and diverse evaluation instances, pushing the boundaries of LLMs' reasoning capabilities.

Some noteworthy papers in this area include: CURE, which introduces a two-stage framework for balancing exploration and exploitation, achieving state-of-the-art performance in both entropy and accuracy. ETTRL, which proposes an entropy-based mechanism to enhance the exploration-exploitation balance in test-time reinforcement learning, resulting in a 68 percent relative improvement in Pass at 1 metric on the AIME 2024 benchmark. EvolMathEval, which presents an automated mathematical benchmark generation and evolution framework, generating a large volume of high-difficulty problems and enhancing the complexity of public datasets. G$^2$RPO-A, which investigates guided group relative policy optimization with adaptive guidance, substantially outperforming vanilla GRPO on mathematical reasoning and code-generation benchmarks. Depth-Breadth Synergy in RLVR, which dissects the popular GRPO algorithm and introduces Difficulty Adaptive Rollout Sampling (DARS), delivering consistent Pass@K gains without extra inference cost at convergence. Beyond Pass@1, which proposes an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, substantially improving Pass@k performance on competition-level benchmarks. Hard Examples Are All You Need, which investigates the effect of example difficulty on GRPO fine-tuning, finding that training on the hardest examples yields the largest performance gains. Your Reward Function for RL is Your Best PRM for Search, which introduces a unified approach combining RL-based and search-based test-time scaling, improving performance by 9% on average over the base model. PuzzleClone, which presents a formal framework for synthesizing verifiable data at scale using Satisfiability Modulo Theories (SMT), generating a curated benchmark comprising over 83K diverse and programmatically validated puzzles.

Sources

CURE: Critical-Token-Guided Re-concatenation for Entropy-collapse Prevention

ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism

On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting

Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes

EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Quantifier Instantiations: To Mimic or To Revolt?

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS

PuzzleClone: An SMT-Powered Framework for Synthesizing Verifiable Data

Built with on top of