Safety and Alignment in Large Language Models

The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on safety and alignment. Recent research has highlighted the importance of considering the potential risks and consequences of LLMs, including their vulnerability to jailbreak attacks and their potential to perpetuate harmful biases. A key direction in this area is the development of more sophisticated safety evaluation methods, which take into account the complexity of instructions and the reasoning capabilities of LLMs. Another important trend is the exploration of new paradigms for safety alignment, such as Constructive Safety Alignment, which prioritizes guiding users towards safe and helpful results rather than simply refusing to engage with harmful content. Notable papers in this area include Thinking Hard, Going Misaligned, which explores the phenomenon of Reasoning-Induced Misalignment in LLMs, and Oyster-I, which introduces a human-centric approach to safety alignment. Additionally, Strata-Sword proposes a hierarchical safety evaluation benchmark, and Unraveling LLM Jailbreaks Through Safety Knowledge Neurons presents a novel neuron-level interpretability method for understanding and defending against jailbreak attacks.

Sources

Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs

Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models in Detecting Hate Speech

Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons

Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

BioBlue: Notable runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Between a Rock and a Hard Place: Exploiting Ethical Reasoning to Jailbreak LLMs

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated

Built with on top of