The field of natural language processing is shifting towards a greater emphasis on safety and fairness in large language models (LLMs). Researchers are working to mitigate the risks of toxic language, bias, and harmful content in LLMs, with a focus on developing innovative methods for detecting and removing problematic content. This includes the use of semantic augmentation, post-generation correction mechanisms, and safety-pretraining frameworks to build safer models from the start. Notably, some papers are proposing new approaches to identify and mitigate online harm, such as the creation of comprehensive taxonomies for abusive language and the development of datasets annotated by domain experts. Overall, the field is moving towards a more nuanced understanding of the complex issues surrounding LLMs and a commitment to creating more responsible and ethical AI systems. Noteworthy papers include:
- A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content, which introduces a novel post-generation correction mechanism to adjust generated content for safety and security.
- Safety Pretraining: Toward the Next Generation of Safe AI, which presents a data-centric pretraining framework that builds safety into the model from the start.
- MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers, which provides a large-scale dataset of multi-modal and multi-categorical online harm to facilitate future work on online harm detection and mitigation.