The field of reinforcement learning is moving towards more efficient and effective training of large language models. Researchers are exploring new approaches to optimize the training process, including the use of mixture-of-experts architectures and novel router-aware methods. There is also a growing interest in unifying different policy gradient optimization methods and providing a lens for understanding the underlying algorithms. Furthermore, the application of reinforcement learning with verifiable rewards to mathematical and coding domains has shown significant improvements in reasoning and problem-solving abilities. Noteworthy papers include:
- Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts, which proposes a novel router-aware approach to optimize importance sampling weights.
- Data-Efficient RLVR via Off-Policy Influence Guidance, which introduces a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective.