The field of reasoning in large language models is moving towards incorporating reinforcement learning (RL) to improve model performance. Researchers are exploring ways to effectively utilize RL to expand the reasoning capabilities of these models, including prolonging RL training and leveraging diverse suites of tasks. The development of novel training methodologies, such as those incorporating KL divergence control and reference policy resetting, is also a key area of focus. Furthermore, the creation of open-source datasets and models is helping to advance the field by providing researchers with accessible tools and resources. Notable papers in this area include:
- ProRL, which demonstrates that prolonged RL training can uncover novel reasoning strategies.
- OpenThoughts, which presents a project aimed at creating open-source datasets for training reasoning models, resulting in state-of-the-art performance on several benchmarks.
- Dissecting Long Reasoning Models, which provides insights into the roles of positive and negative samples in RL and identifies data inefficiencies in group relative policy optimization.