Advances in Large Language Model Alignment and Robustness

The field of large language model (LLM) research is rapidly advancing, with a focus on improving alignment with human preferences and robustness to spurious correlations. Recent studies have investigated various fine-tuning techniques, including supervised and preference-based methods, to enhance LLM performance and generalization. The importance of distributional robustness in reward models has also been highlighted, with proposals for batch-wise sum-to-zero regularization and direct density ratio optimization. Furthermore, novel algorithms for privacy-preserving alignment have been developed, demonstrating state-of-the-art performance while ensuring differential privacy. Noteworthy papers include:

  • PARM, which proposes a unified Autoregressive Reward Model for multi-objective test-time alignment, reducing inference costs and improving alignment with preference vectors.
  • On the Robustness of Reward Models for Language Model Alignment, which introduces batch-wise sum-to-zero regularization to improve robustness in reward models.
  • Direct Density Ratio Optimization, which provides a statistically consistent approach to aligning LLMs with human preferences.
  • InfoPO, which eliminates the reliance on the Bradley-Terry model and prevents overfitting in preference fine-tuning.
  • Improved Algorithms for Differentially Private Language Model Alignment, which achieves state-of-the-art performance in privacy-preserving alignment.
  • WorldPM, which proposes a unified representation of human preferences and demonstrates scalability potential in preference modeling.

Sources

Assessing Robustness to Spurious Correlations in Post-Training Language Models

PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

User Behavior Analysis in Privacy Protection with Large Language Models: A Study on Privacy Preferences with Limited Data

On the Robustness of Reward Models for Language Model Alignment

Direct Density Ratio Optimization: A Statistically Consistent Approach to Aligning Large Language Models

RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models

InfoPO: On Mutual Information Maximization for Large Language Model Alignment

Improved Algorithms for Differentially Private Language Model Alignment

WorldPM: Scaling Human Preference Modeling

Built with on top of