Advances in Aligning Large Language Models with Human Preferences

The field of large language models (LLMs) is rapidly advancing, with a growing focus on aligning these models with human preferences and values. Recent research has highlighted the importance of developing methods that can effectively incorporate human feedback and preferences into the training process. One key area of innovation is in the development of reward models that can accurately capture human preferences and values. Researchers have proposed various approaches, including collaborative reward modeling, multi-objective preference optimization, and preference learning with lie detectors. These methods aim to improve the robustness and reliability of LLMs, enabling them to generate more accurate and helpful responses. Notably, the development of datasets such as HelpSteer3-Preference has provided a valuable resource for training and evaluating LLMs. Furthermore, research has also explored the use of reinforcement learning from user feedback, which holds promise for aligning LLMs with real-world user preferences. Overall, the field is moving towards more sophisticated and human-centered approaches to LLM development, with a focus on safety, fairness, and transparency. Noteworthy papers include: Collaborative Reward Modeling, which proposes a novel framework for combining peer review and curriculum learning to enhance robustness. Multi-Objective Preference Optimization, which introduces an algorithm for optimizing multiple objectives in preference alignment. Preference Learning with Lie Detectors, which examines the use of lie detectors in preference learning and its potential to induce honesty or evasion.

Sources

Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

A Systematic Analysis of Base Model Choice for Reward Modeling

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

MPMA: Preference Manipulation Attack Against Model Context Protocol

HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Detecting Prefix Bias in LLM-based Reward Models

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals

Pantheon: Personalized Multi-objective Ensemble Sort via Iterative Pareto Policy Optimization

Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs

Reinforcement Learning from User Feedback

DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

AI vs. Human Judgment of Content Moderation: LLM-as-a-Judge and Ethics-Based Response Refusals

LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO

Reverse Engineering Human Preferences with Reinforcement Learning

Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

MPO: Multilingual Safety Alignment via Reward Gap Optimization