The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety alignment and moderation. Recent research has highlighted the importance of addressing harmful behaviors in LLMs, such as sexism, misogyny, and toxicity. Studies have shown that LLMs can be fine-tuned to detect and mitigate these issues, but also that they can be vulnerable to biases and exploits. The development of new frameworks and methods for probing and steering harmful content in LLMs has advanced the field, offering practical tools for auditing and hardening future generations of language models. Noteworthy papers in this area include: The Blessing and Curse of Dimensionality in Safety Alignment, which demonstrates the impact of high-dimensional representations on safety alignment, and Beyond Binary Moderation, which introduces a fine-grained, multi-class classification framework for identifying sexist and misogynistic behavior on GitHub. Additionally, research on the geometry of harmfulness in LLMs and the use of curved inference for concern-sensitive geometry has provided new insights into the internal workings of these models. Overall, the field is moving towards a more nuanced understanding of the complex issues surrounding LLMs and their potential impact on society.
Advances in Safety Alignment and Moderation of Large Language Models
Sources
Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models
ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs