Safety and Fairness in Large Language Models

The field of natural language processing is shifting towards a greater emphasis on safety and fairness in large language models (LLMs). Researchers are working to mitigate the risks of toxic language, bias, and harmful content in LLMs, with a focus on developing innovative methods for detecting and removing problematic content. This includes the use of semantic augmentation, post-generation correction mechanisms, and safety-pretraining frameworks to build safer models from the start. Notably, some papers are proposing new approaches to identify and mitigate online harm, such as the creation of comprehensive taxonomies for abusive language and the development of datasets annotated by domain experts. Overall, the field is moving towards a more nuanced understanding of the complex issues surrounding LLMs and a commitment to creating more responsible and ethical AI systems. Noteworthy papers include:

  • A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content, which introduces a novel post-generation correction mechanism to adjust generated content for safety and security.
  • Safety Pretraining: Toward the Next Generation of Safe AI, which presents a data-centric pretraining framework that builds safety into the model from the start.
  • MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers, which provides a large-scale dataset of multi-modal and multi-categorical online harm to facilitate future work on online harm detection and mitigation.

Sources

A Baseline for Self-state Identification and Classification in Mental Health Data: CLPsych 2025 Task

Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification

Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability

Combating Toxic Language: A Review of LLM-Based Strategies for Software Engineering

LLM-based Semantic Augmentation for Harmful Content Detection

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

Safety Pretraining: Toward the Next Generation of Safe AI

Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control

Towards a comprehensive taxonomy of online abusive language informed by machine leaning

Built with on top of