Advances in Unlearning and Preference Learning for Large Language Models

The field of large language models is moving towards developing more robust and responsible AI systems, with a focus on unlearning and preference learning. Recent research has highlighted the importance of removing unwanted knowledge from models while preserving their overall performance. This has led to the development of new techniques, such as variational inference frameworks and activation steering, which enable efficient and effective unlearning. Additionally, there is a growing interest in preference learning, with approaches that anchor training on real-world dissatisfaction signals and sample positives dynamically from evolving policies. These innovations have the potential to significantly improve the safety and reliability of large language models. Noteworthy papers include: DRIFT, which introduces a dissatisfaction-refined iterative preference training method that achieves state-of-the-art results on WildBench and AlpacaEval2 benchmarks. Latent Diffusion Unlearning, which proposes a novel model-based perturbation strategy that operates within the latent space of diffusion models to protect against unauthorized personalization. MLLMEraser, which enables test-time unlearning in multimodal large language models through activation steering. Distribution Preference Optimization, which derives a novel unlearning algorithm that targets the next-token probability distribution instead of entire responses.

Sources

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Latent Diffusion Unlearning: Protecting Against Unauthorized Personalization Through Trajectory Shifted Perturbations

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

Variational Diffusion Unlearning: A Variational Inference Framework for Unlearning in Diffusion Models under Data Constraints

MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

Distribution Preference Optimization: A Fine-grained Perspective for LLM Unlearning

Built with on top of