Reinforcement Learning for Large Language Models

The field of large language models (LLMs) is moving towards leveraging reinforcement learning (RL) to enhance their reasoning capabilities. Researchers are exploring various RL-based methods to improve the performance of LLMs, particularly in scenarios where supervised fine-tuning is limited. One of the key areas of focus is the development of novel RL algorithms that can effectively estimate advantages and optimize policy training. Another important direction is the integration of external tools and multimodal inputs to enable more effective and collaborative reasoning. Notable papers in this area include AAPO, which proposes a momentum-based estimation scheme to mitigate training inefficiencies, and Tool-Star, which introduces a framework for empowering LLMs to invoke multiple external tools during stepwise reasoning. R1-ShareVL and KTAE also present innovative approaches to incentivize reasoning capabilities and estimate token-level advantages, respectively. SophiaVL-R1 is another noteworthy paper that proposes a thinking reward model to supervise the thinking process in RL-based training.

Sources

AAPO: Enhance the Reasoning Capabilities of LLMs with Advantage Momentum

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Built with on top of