Reinforcement Learning for Large Language Models

The field of reinforcement learning is moving towards more efficient and effective training of large language models. Researchers are exploring new approaches to optimize the training process, including the use of mixture-of-experts architectures and novel router-aware methods. There is also a growing interest in unifying different policy gradient optimization methods and providing a lens for understanding the underlying algorithms. Furthermore, the application of reinforcement learning with verifiable rewards to mathematical and coding domains has shown significant improvements in reasoning and problem-solving abilities. Noteworthy papers include:

  • Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts, which proposes a novel router-aware approach to optimize importance sampling weights.
  • Data-Efficient RLVR via Off-Policy Influence Guidance, which introduces a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.

Sources

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Rethinking GSPO: The Perplexity-Entropy Equivalence

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Data-Efficient RLVR via Off-Policy Influence Guidance

Built with on top of