Advances in Mitigating Memorization and Bias in Large Language Models

The field of large language models (LLMs) is moving towards addressing critical concerns around memorization, privacy, and bias. Researchers are exploring innovative approaches to mitigate these issues, including the development of new paradigms and frameworks that promote fairness and transparency. A key direction is the isolation of memorized content, making it easier to remove without compromising general language capabilities. Another area of focus is the identification of memorized personal data, enabling the dynamic construction of forget sets for machine unlearning and right-to-be-forgotten requests. Furthermore, there is a growing interest in guiding LLM decision-making with fairness reward models, which can down-weight biased trajectories and favor equitable ones. Noteworthy papers in this area include:

  • A study introducing a new paradigm called MemSinks that facilitates isolation of memorized content,
  • Research presenting a model-agnostic metric to quantify human-fact associations in LLMs,
  • A framework for training a generalizable Fairness Reward Model that enables trustworthy use of reasoning models in high-stakes decision-making.

Sources

Memorization Sinks: Isolating Memorization during LLM Training

What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Guiding LLM Decision-Making with Fairness Reward Models

Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation

Built with on top of