The field of reinforcement learning for large language models is moving towards more efficient and effective methods for credit assignment and policy optimization. Recent developments have focused on improving the ability of models to learn from sparse and delayed rewards, with a particular emphasis on reasoning tasks such as math and medical question answering. Notable advancements include the use of tree structures for credit assignment, mixed advantage policy optimization, and negative-enhanced group relative policy optimization. These innovations have led to significant improvements in model performance on various benchmarks. Some noteworthy papers in this area include:
- TEMPO, which introduces a critic-free algorithm that augments the group-relative outcome signal with branch-gated temporal-difference corrections, outperforming existing methods on several benchmarks.
- NGRPO, which proposes an algorithm that converts homogeneous errors into robust learning signals, achieving state-of-the-art results on mathematical reasoning benchmarks.
- Soft Tokens, Hard Truths, which introduces a scalable method to learn continuous Chain-of-Thought via reinforcement learning, showing greater diversity and preserving base model predictions on out-of-domain tasks.
- SIM-CoT, which proposes a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space, significantly improving in-domain accuracy and out-of-domain stability of implicit Chain-of-Thought methods.