The field of large language models is moving towards developing more robust and responsible AI systems, with a focus on unlearning and preference learning. Recent research has highlighted the importance of removing unwanted knowledge from models while preserving their overall performance. This has led to the development of new techniques, such as variational inference frameworks and activation steering, which enable efficient and effective unlearning. Additionally, there is a growing interest in preference learning, with approaches that anchor training on real-world dissatisfaction signals and sample positives dynamically from evolving policies. These innovations have the potential to significantly improve the safety and reliability of large language models. Noteworthy papers include: DRIFT, which introduces a dissatisfaction-refined iterative preference training method that achieves state-of-the-art results on WildBench and AlpacaEval2 benchmarks. Latent Diffusion Unlearning, which proposes a novel model-based perturbation strategy that operates within the latent space of diffusion models to protect against unauthorized personalization. MLLMEraser, which enables test-time unlearning in multimodal large language models through activation steering. Distribution Preference Optimization, which derives a novel unlearning algorithm that targets the next-token probability distribution instead of entire responses.
Advances in Unlearning and Preference Learning for Large Language Models
Sources
Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations
Variational Diffusion Unlearning: A Variational Inference Framework for Unlearning in Diffusion Models under Data Constraints