Advances in Preference-Aware Language Models

The field of language models is moving towards developing more sophisticated and nuanced models that can capture individual and group-level preferences. Researchers are exploring various approaches to improve user satisfaction estimation, including the use of preference-adaptive reinforcement learning and expectation-maximization-based clustering algorithms. Another area of focus is on improving the interpretability of neural networks, with techniques such as sparse autoencoders and mechanistic interpretability being developed. Additionally, there is a growing interest in multilingual preference optimization, with methods being proposed to enhance robustness to noisy or low-margin comparisons. Noteworthy papers include: CAPO, which proposes a dynamic loss scaling mechanism to improve preference optimization in multilingual settings. SCALAR, which introduces a benchmark for measuring interaction sparsity between sparse autoencoder features and proposes a new architecture, Staircase SAEs, to improve relative sparsity. AMaPO, which resolves the Overfitting-Underfitting Dilemma in offline preference optimization by employing an instance-wise adaptive margin. C$^3$TG, which proposes a two-phase framework for fine-grained, multi-dimensional text attribute control. SparseRM, which leverages sparse autoencoders to extract preference-relevant information and construct a lightweight and interpretable reward model.

Sources

Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences

SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder

Diverse Preference Learning for Capabilities and Alignment

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

C$^3$TG: Conflict-aware, Composite, and Collaborative Controlled Text Generation

AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment

Built with on top of