The field of large language models is moving towards addressing the issue of implicit biases and stereotypes in model outputs. Researchers are proposing various methods to detect and mitigate these biases, including interpretable bias detection, topological data analysis, and activation steering. These approaches aim to identify and remove biases in language models, ensuring that they generate more accurate and fair representations of different demographic groups. Noteworthy papers in this regard include 'Semantic and Structural Analysis of Implicit Biases in Large Language Models' which proposes an interpretable bias detection method, and 'Activation Steering for Bias Mitigation' which introduces a complete system for identifying and mitigating bias directly within a model's internal workings. Additionally, 'MSRS: Adaptive Multi-Subspace Representation Steering for Attribute Alignment in Large Language Models' proposes a novel framework for effective multi-attribute steering via subspace representation fine-tuning. These studies demonstrate the importance of addressing biases in large language models and provide innovative solutions to this problem.
Debiasing Large Language Models
Sources
Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach
Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution
"Draw me a curator" Examining the visual stereotyping of a cultural services profession by generative AI