Advances in Language Model Alignment

The field of language model alignment is rapidly evolving, with a focus on developing more effective and efficient methods for improving the safety and helpfulness of large language models. Recent research has highlighted the importance of fine-tuning strategies, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), in achieving better alignment. The use of transparent alignment frameworks, such as Feature Steering with Reinforcement Learning (FSRL), has also shown promise in providing interpretable model control and diagnosing internal mechanisms of alignment. Furthermore, the development of novel reward shaping approaches, such as semantic reward modeling with encoder-only transformers, has improved the ability to align model outputs with complex, qualitative goals. Noteworthy papers in this area include: Improving LLM Safety and Helpfulness using SFT and DPO, which demonstrates the effectiveness of combining SFT and DPO for improving model alignment. RL Fine-Tuning Heals OOD Forgetting in SFT, which discovers the rotation of singular vectors as the key mechanism behind the synergy of SFT and RL. The Anatomy of Alignment, which introduces FSRL as a transparent alignment framework. When Inverse Data Outperforms, which highlights the pitfalls of mixed data in multi-stage fine-tuning. Shaping Explanations, which proposes a novel approach to reward shaping using a small, efficient encoder-only transformer as a semantic reward model.

Sources

Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M

RL Fine-Tuning Heals OOD Forgetting in SFT

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning

Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

Built with on top of