Advances in Large Language Model Safety

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and reliability. Recent developments have centered around addressing the vulnerabilities of LLMs, including their tendency to generate harmful content and susceptibility to jailbreak attacks. Researchers are exploring innovative approaches to mitigate these risks, such as reachability analysis, multi-objective alignment, and inverse reasoning. These methods aim to provide more accurate and earlier detection of unsafe continuations, as well as more effective steering mechanisms to redirect generation away from unsafe regions. Noteworthy papers in this area include Preemptive Detection and Steering of LLM Misalignment via Latent Reachability, which proposes a reachability-based framework for inference-time LLM safety, and InvThink, which introduces a novel approach to inverse thinking for safer language models. Overall, the field is moving towards more robust and controllable LLMs, with a focus on dynamic, modular, and inference-aware control.

Sources

Preemptive Detection and Steering of LLM Misalignment via Latent Reachability

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Detoxifying Large Language Models via Autoregressive Reward Guided Representation Editing

Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection

InvThink: Towards AI Safety via Inverse Reasoning

Towards Speeding up Program Repair with Non-Autoregressive Model

Inverse Language Modeling towards Robust and Grounded LLMs

UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models