Advances in Efficient Reinforcement Learning and Preference Optimization

The field of reinforcement learning and preference optimization is rapidly advancing, with a focus on improving efficiency and scalability. Recent developments have centered around enhancing the training of large language models (LLMs) and addressing the limitations of existing methods. Notably, researchers have proposed innovative approaches to mitigate over-optimization, improve data utilization, and develop more effective optimization frameworks.

One key direction is the development of efficient algorithms for Group Relative Policy Optimization (GRPO), which has shown promise in enhancing policy learning. Another significant area of research is the application of importance sampling and other techniques to mitigate the over-optimization problem in Direct Alignment Algorithms (DAAs).

Additionally, there is a growing interest in preference learning methods that can align aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Novel frameworks and algorithms have been introduced to address this challenge, incorporating axioms from social choice theory and leveraging data augmentation and expansion techniques.

Some noteworthy papers in this area include:

  • Prefix Grouper, which proposes an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy.
  • ConfPO, a method for preference learning in LLMs that identifies and optimizes preference-critical tokens based solely on the training policy's confidence.
  • Omni-DPO, a dual-perspective optimization framework that jointly accounts for the inherent quality of each preference pair and the model's evolving performance on those pairs.
  • RePO, which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt.

Sources

Prefix Grouper: Efficient GRPO Training through Shared-Prefix Forward

Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Large Language Model Preference Optimization

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

RePO: Replay-Enhanced Policy Optimization

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms

Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design

Delegations as Adaptive Representation Patterns: Rethinking Influence in Liquid Democracy

Metritocracy: Representative Metrics for Lite Benchmarks

Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs

Built with on top of