Large Language Model Alignment and Optimization

The field of large language models (LLMs) is moving towards more effective alignment and optimization techniques. Recent developments focus on improving the performance of LLMs through post-training methods such as supervised fine-tuning and reinforcement learning fine-tuning. Notable advancements include the understanding of how these methods reshape model representation and out-of-distribution performance, as well as the development of novel algorithms to alleviate issues such as reward hacking. A key direction in the field is the analysis of preference data and its influence on direct preference optimization (DPO) performance. Studies have shown that the quality of chosen responses plays a dominant role in optimizing the DPO objective, while the quality of rejected responses has relatively limited impact. Furthermore, new evaluation paradigms are being proposed to assess preference alignment, including hypothesis-based analysis frameworks that formulate preference alignment as a re-ranking process within hypothesis spaces. Noteworthy papers in this area include:

  • Weights-Rotated Preference Optimization for Large Language Models, which proposes a novel algorithm to prevent policy models from deviating too far from reference models, thereby retaining knowledge and expressive capabilities.
  • What Matters in Data for DPO, which provides a systematic study of how preference data distribution influences DPO performance.
  • HEAL: A Hypothesis-Based Preference-Aware Analysis Framework, which presents a novel evaluation paradigm for preference alignment and offers robust diagnostic tools for refining preference optimization methods.

Sources

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

Weights-Rotated Preference Optimization for Large Language Models

What Matters in Data for DPO?

HEAL: A Hypothesis-Based Preference-Aware Analysis Framework

Built with on top of