Advances in Large Language Model Alignment and Optimization

The field of large language models (LLMs) is rapidly advancing, with a focus on improving alignment and optimization techniques. Recent developments have centered around enhancing the ability of LLMs to incorporate human values and preferences, with approaches such as survey-to-behavior alignment and reward-guided decoding showing promise. Additionally, there is a growing interest in multimodal LLMs, with techniques like input-dependent steering and multi-objective alignment via value-guided inference-time search being explored. Noteworthy papers in this area include 'Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions', which demonstrates the effectiveness of fine-tuning LLMs on value surveys to change their behavior, and 'Controlling Multimodal LLMs via Reward-guided Decoding', which introduces a method for reward-guided decoding of multimodal LLMs to improve their visual grounding. Overall, these advances have the potential to significantly improve the performance and safety of LLMs in a wide range of applications.

Sources

Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization

Survey-to-Behavior: Downstream Alignment of Human Values in LLMs via Survey Questions

Controlling Multimodal LLMs via Reward-guided Decoding

Large Language Models Enable Personalized Nudges to Promote Carbon Offsetting Among Air Travellers

J6: Jacobian-Driven Role Attribution for Multi-Objective Prompt Optimization in LLMs

RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards

M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following

Reinforcement Learning with Rubric Anchors

Learning to Steer: Input-dependent Steering for Multimodal LLMs

MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search

Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

Improving LLMs for Machine Translation Using Synthetic Preference Data

Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

Subjective Behaviors and Preferences in LLM: Language of Browsing

Built with on top of