Debiasing Large Language Models

The field of large language models is moving towards addressing the issue of implicit biases and stereotypes in model outputs. Researchers are proposing various methods to detect and mitigate these biases, including interpretable bias detection, topological data analysis, and activation steering. These approaches aim to identify and remove biases in language models, ensuring that they generate more accurate and fair representations of different demographic groups. Noteworthy papers in this regard include 'Semantic and Structural Analysis of Implicit Biases in Large Language Models' which proposes an interpretable bias detection method, and 'Activation Steering for Bias Mitigation' which introduces a complete system for identifying and mitigating bias directly within a model's internal workings. Additionally, 'MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models' proposes a novel framework for effective multi-attribute steering via subspace representation fine-tuning. These studies demonstrate the importance of addressing biases in large language models and provide innovative solutions to this problem.

Sources

Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach

Measuring Stereotype and Deviation Biases in Large Language Models

Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

"Draw me a curator" Examining the visual stereotyping of a cultural services profession by generative AI

Augmenting Bias Detection in LLMs Using Topological Data Analysis

Momentum Point-Perplexity Mechanics in Large Language Models

Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems

Steering Towards Fairness: Mitigating Political Bias in LLMs

BiasGym: Fantastic Biases and How to Find (and Remove) Them

Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLMs

A Close Reading Approach to Gender Narrative Biases in AI-Generated Stories

Semantic Structure in Large Language Model Embeddings

Yet another algorithmic bias: A Discursive Analysis of Large Language Models Reinforcing Dominant Discourses on Gender and Race

MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models