Advances in Reinforcement Learning from Human Feedback

The field of reinforcement learning from human feedback (RLHF) is rapidly advancing, with a focus on developing more efficient and robust algorithms. Recent research has explored the use of novel approaches, such as intuitionistic fuzzy sets and symmetric losses, to improve the quality and reliability of human preference data. Additionally, there is a growing interest in designing algorithms that can handle noisy or uncertain preference data, as well as those that can operate without requiring a known link function. These developments have the potential to significantly improve the performance and alignment of large language models with human intent. Notable papers in this area include: Thompson Sampling in Online RLHF with General Function Approximation, which proposes a model-free posterior sampling algorithm with theoretical guarantees. Intuitionistic Fuzzy Sets for Large Language Model Data Annotation, which introduces a novel framework for modeling and aggregating human preferences. On Symmetric Losses for Robust Policy Optimization with Noisy Preferences, which proposes a principled framework for robust policy optimization under noisy preferences. Stochastically Dominant Peer Prediction, which proposes a new peer prediction mechanism that incentivizes truthful reporting. Provable Reinforcement Learning from Human Feedback with an Unknown Link Function, which proposes a novel policy optimization algorithm that does not require knowing the link function. UnHiPPO: Uncertainty-aware Initialization for State Space Models, which extends the HiPPO theory with measurement noise and derives an uncertainty-aware initialization for state space model dynamics.

Sources

Thompson Sampling in Online RLHF with General Function Approximation

Intuitionistic Fuzzy Sets for Large Language Model Data Annotation: A Novel Approach to Side-by-Side Preference Labeling

On Symmetric Losses for Robust Policy Optimization with Noisy Preferences

Stochastically Dominant Peer Prediction

Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

UnHiPPO: Uncertainty-aware Initialization for State Space Models

Built with on top of