Advances in Reward Modeling and Reinforcement Learning for Large Language Models

The field of large language models is rapidly evolving, with a focus on improving reward modeling and reinforcement learning techniques. Recent developments aim to address the challenges of reward hacking, where models optimize for superficial or spurious attributes rather than true causal drivers of quality. Innovations in this area include the use of causal rubrics, hedging on proxy rewards, and asymmetric REINFORCE algorithms. These advancements have the potential to significantly improve the performance and alignment of large language models. Notable papers in this area include:

  • Robust Reward Modeling via Causal Rubrics, which introduces a novel framework for mitigating reward hacking.
  • Inference-Time Reward Hacking in Large Language Models, which characterizes reward hacking in inference-time alignment and introduces an efficient algorithm to find the optimal inference-time parameter.
  • Asymmetric REINFORCE for off-Policy Reinforcement Learning, which provides a theoretical analysis of off-policy REINFORCE algorithms and validates its findings experimentally.
  • Mastering Multiple-Expert Routing, which introduces novel surrogate loss functions and efficient algorithms with strong theoretical learning guarantees.
  • Bridging Offline and Online Reinforcement Learning for LLMs, which investigates the effectiveness of reinforcement learning methods for fine-tuning large language models in offline, semi-online, and fully online regimes.

Sources

Robust Reward Modeling via Causal Rubrics

Inference-Time Reward Hacking in Large Language Models

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer

Bridging Offline and Online Reinforcement Learning for LLMs

Built with on top of