Advances in Aligning Language Models with Human Preferences

The field of natural language processing is moving towards developing more sophisticated methods for aligning language models with human preferences. Recent research has focused on improving the tradeoff between expected reward and the probability of undesired outputs, as well as enhancing the reliability and robustness of language models. Notable advancements include the development of new training methods, such as RePULSe, and the application of explainable AI techniques to improve model transparency and trustworthiness. Additionally, there is a growing interest in using large language models as in-context meta-learners for model and hyperparameter selection, as well as for tracing value alignment during post-training. Overall, the field is progressing towards more effective and efficient methods for aligning language models with human values and preferences. Noteworthy papers include RePULSe, which introduces a new training method for reducing the probability of undesired outputs, and Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics, which presents an application-grounded user study on the effectiveness of transparent AI assistance in clinical workflows.

Sources

Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study

PREFINE: Personalized Story Generation via Simulated User Critics and User-Specific Rubric Generation

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions

Generalization or Memorization: Dynamic Decoding for Mode Steering

Complementary Human-AI Clinical Reasoning in Ophthalmology

CHOIR: Collaborative Harmonization fOr Inference Robustness

Interpreting and Mitigating Unwanted Uncertainty in LLMs

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

ProfileXAI: User-Adaptive Explainable AI

Lightweight Robust Direct Preference Optimization

Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection

Ideology-Based LLMs for Content Moderation

Approximating Human Preferences Using a Multi-Judge Learned System

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Test-Time Alignment of LLMs via Sampling-Based Optimal Control in pre-logit space

LLMs as In-Context Meta-Learners for Model and Hyperparameter Selection

Value Drifts: Tracing Value Alignment During LLM Post-Training