Advances in Reinforcement Learning for Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving reasoning capabilities through reinforcement learning (RL). Recent developments have led to the creation of novel RL algorithms, such as Difficulty Aware Certainty guided Exploration (DACE) and Balanced Actor Initialization (BAI), which address challenges in exploration-exploitation trade-offs and stable training. Additionally, frameworks like Ranked Preference Reinforcement Optimization (RPRO) and Reasoning Vectors have been proposed to enhance medical question answering and transfer reasoning capabilities between models. Noteworthy papers include 'Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning', which introduces DACE, and 'Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic', which demonstrates the transfer of reasoning abilities between models. These innovative approaches are pushing the boundaries of LLM capabilities and paving the way for more advanced and efficient models.

Sources

Know When to Explore: Difficulty-Aware Certainty as a Guide for LLM Reinforcement Learning

Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models

RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Autonomous Learning From Success and Failure: Goal-Conditioned Supervised Learning with Negative Feedback

Empowering Lightweight MLLMs with Reasoning via Long CoT SFT

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

On Entropy Control in LLM-RL Algorithms

AR$^2$: Adversarial Reinforcement Learning for Abstract Reasoning in Large Language Models

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Self-Aligned Reward: Towards Effective and Efficient Reasoners

DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Reverse-Engineered Reasoning for Open-Ended Generation

Reasoning Language Model for Personalized Lung Cancer Screening

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

Culturally transmitted color categories in LLMs reflect a learning bias toward efficient compression

A Survey of Reinforcement Learning for Large Reasoning Models