Uncovering and Mitigating Biases in Large Language Models

The field of large language models (LLMs) is moving towards a deeper understanding of the biases and unfairness present in these models. Recent research has highlighted the importance of systematically analyzing and addressing these biases, particularly in multi-agent systems and during conversations. The development of new methods and frameworks for detecting and mitigating biases has been a key focus, with a emphasis on improving model reliability and alignment with human values. Notably, innovative approaches such as differential analysis and inference-time masking of bias heads, as well as the creation of counterfactual bias evaluation frameworks, have shown promising results in reducing unfairness and promoting fair model behavior.

Some noteworthy papers in this area include: The paper CoBia presents a suite of lightweight adversarial attacks to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. The paper DiffHeads proposes a lightweight debiasing framework for LLMs, which reduces unfairness by 49.4% and 40.3% under Direct-Answer and Chain-of-Thought prompting, respectively. The paper Analysing Moral Bias in Finetuned LLMs demonstrates that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining. The paper Adaptive Generation of Bias-Eliciting Questions introduces a counterfactual bias evaluation framework that automatically generates realistic, open-ended questions to systematically explore areas where models are most susceptible to biased behavior.

Sources

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

The Social Cost of Intelligence: Emergence, Propagation, and Amplification of Stereotypical Bias in Multi-Agent Systems

Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Adaptive Generation of Bias-Eliciting Questions for LLMs

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Built with on top of