The field of natural language processing is witnessing a significant shift towards leveraging reinforcement learning (RL) to advance long-form writing and reasoning capabilities. Researchers are exploring innovative approaches to overcome the limitations of traditional supervised fine-tuning methods, such as data saturation and restricted learning capacity. One notable direction is the development of adaptive curriculum RL frameworks, which enable models to learn from selectively chosen training examples and adapt to evolving task difficulty. This leads to improved long-form writing performance and surprising generalization to long-input reasoning tasks. Another area of focus is the creation of Reinforcement-Learned Teachers (RLTs) that can effectively distill knowledge to student models, unlocking new levels of efficiency and re-usability for the RL reasoning framework. Furthermore, online curriculum learning and in-context exploration are being investigated to accelerate training and enable extrapolation of test-time compute for large language models. Noteworthy papers in this area include:
- Writing-RL, which presents an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities.
- Reinforcement Learning Teachers of Test Time Scaling, which introduces a new framework for training RLTs to yield effective downstream distillation.
- SPEED-RL, which proposes an adaptive online RL curriculum for selective prompting.
- e3, which enables extrapolation of test-time compute by training LLMs to perform in-context exploration.