Advances in Large Language Model Reasoning

The field of large language model (LLM) reasoning is witnessing a significant shift towards the development of more efficient and effective reinforcement learning (RL) methods. Researchers are exploring alternative approaches to traditional policy-based methods, such as value-based methods and Q-learning, to improve the sample efficiency and offline learning capabilities of LLMs. Another key area of focus is the development of adaptive curriculum learning methods, which can learn a curriculum policy concurrently with the RL fine-tuning process, enabling better generalization to harder, out-of-distribution test problems. Noteworthy papers in this area include: ShiQ, which derives theoretically grounded loss functions from Bellman equations to adapt Q-learning methods to LLMs, and Trajectory Bellman Residual Minimization, which introduces a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as Q-values. Additionally, papers such as Self-Evolving Curriculum and Dynamic Sampling that Adapts demonstrate the importance of adaptive data selection and curriculum learning in advancing LLM reasoning capabilities.

Sources

ShiQ: Bringing back Bellman to LLMs

Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Self-Evolving Curriculum for LLM Reasoning

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Built with on top of