Advances in Preference Learning for Large Language Models

The field of Large Language Models (LLMs) is moving towards more robust and reliable models through the development of innovative preference learning methods. Recent research has focused on addressing the limitations of traditional Reinforcement Learning from Human Feedback (RLHF) approaches, such as inefficiency, bias, and vulnerability to adversarial attacks. New methods, including Adversarial Preference Learning and Dynamic Target Margins, have been proposed to improve the alignment of LLMs with human preferences. These approaches have shown promising results in enhancing the robustness and safety of LLMs, with significant improvements in performance on various benchmarks. Noteworthy papers in this area include Adversarial Preference Learning for Robust LLM Alignment, which introduces an iterative adversarial training method to improve robustness, and Robust Preference Optimization via Dynamic Target Margins, which proposes a dynamic target margin preference optimization algorithm to enhance alignment. Overall, the field is shifting towards more robust and reliable preference learning methods, with a focus on addressing the challenges and limitations of traditional approaches.

Sources

Adversarial Preference Learning for Robust LLM Alignment

Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

BPO: Revisiting Preference Modeling in Direct Preference Optimization

Robust Preference Optimization via Dynamic Target Margins

Crowd-SFT: Crowdsourcing for LLM Alignment

RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

Built with on top of