Advances in Large Language Model Reasoning

The field of large language models (LLMs) is rapidly advancing, with a focus on improving reasoning capabilities through reinforcement learning (RL) and other techniques. Recent developments have highlighted the importance of exploration, diversity, and robustness in LLMs, with various approaches proposed to address these challenges. Notably, sequential sampling, rubric-based incremental training, and Bayesian optimal stopping have shown promise in enhancing LLM performance. Additionally, innovations in policy optimization, such as balanced policy optimization with adaptive clipping and scaffolded group relative policy optimization, have improved training stability and efficiency. These advancements have significant implications for applications in code generation, conversational recommender systems, and other areas.

Noteworthy papers include: InfiMed-ORBIT, which introduces a rubric-based incremental training framework for high-stakes medical dialogue, achieving state-of-the-art results on the HealthBench-Hard benchmark. BEACON, which proposes a Bayesian optimal stopping framework for efficient LLM sampling, reducing average sampling by up to 80% while maintaining response quality. Scaf-GRPO, which presents a scaffolded group relative policy optimization framework for enhancing LLM reasoning, boosting the pass@1 score on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline.

Sources

The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

BEACON: Bayesian Optimal Stopping for Efficient LLM Sampling

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

KL-Regularized Reinforcement Learning is Designed to Mode Collapse

Built with on top of