Advances in Aligning Large Language Models with Human Preferences

The field of large language models (LLMs) is rapidly evolving, with a growing focus on ensuring these models align with human values and intentions. Recent research has explored various techniques for achieving this alignment, including preference-based methods, supervised fine-tuning, and direct preference optimization. These approaches aim to balance the trade-offs between core alignment objectives, such as instruction-following and nuanced human intent. Noteworthy papers in this area include: PITA, which introduces a novel framework for integrating preference feedback directly into LLM token generation, eliminating the need for a pre-trained reward model. MaPPO, which proposes a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective, enhancing alignment and mitigating oversimplified binary classification of responses.

Sources

Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges

Efficient Learning for Product Attributes with Compact Multimodal Models

PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

Learning to Align Human Code Preferences

SGPO: Self-Generated Preference Optimization based on Self-Improver

The Value of Gen-AI Conversations: A bottom-up Framework for AI Value Alignment

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases

Built with on top of