Efficient Reinforcement Learning for Large Language Models

The field of reinforcement learning for large language models is moving towards more efficient and scalable methods. Researchers are exploring new architectures and algorithms that can reduce the computational cost of training and inference, while maintaining or improving the performance of the models. One of the key directions is the development of methods that can leverage the similarity of historical rollout token sequences to accelerate RL training. Another important area of research is the design of frameworks that can combine the strengths of different approaches, such as rule-based reinforcement learning and optimized self-training, to achieve state-of-the-art performance. Noteworthy papers in this area include TreePO, which introduces a self-guided rollout algorithm that reduces the per-update compute burden while preserving exploration diversity. RhymeRL is another notable work, which accelerates RL training by leveraging the similarity of historical rollout token sequences and balancing workload among rollout workers. ReST-RL is also a significant contribution, which combines an improved GRPO algorithm with a meticulously designed test time decoding method to achieve high reasoning accuracy.

Sources

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs

History Rhymes: Accelerating LLM Reinforcement Learning with RhymeRL

Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

HAEPO: History-Aggregated Exploratory Policy Optimization

ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

Towards Better Correctness and Efficiency in Code Generation

Built with on top of