Advances in LLM Safety and Reliability

The field of Large Language Models (LLMs) is moving towards addressing the critical issue of over-refusal, where models erroneously reject benign queries due to overly conservative safety measures. Researchers are exploring innovative methods to detect and analyze over-refusals, such as evolutionary testing frameworks and task-specific trajectory-shifting approaches. Additionally, there is a growing focus on fine-tuning LLMs to improve their safety and reliability, including the development of optimization techniques and safety-aware adaptation methods. Noteworthy papers in this area include ORFuzz, which introduces a novel testing framework for detecting over-refusals, and SafeConstellations, which proposes a trajectory-shifting approach to reduce over-refusals. Other notable works include ToxiFrench, which benchmarks and enhances language models for French toxicity detection, and Rethinking Safety in LLM Fine-tuning, which challenges the belief that fine-tuning inevitably harms safety. Furthermore, researchers are also exploring methods to mitigate unintended misalignment in agentic fine-tuning, such as the Prefix INjection Guard (PING) method. Overall, these advances are paving the way for the development of more reliable and trustworthy LLM-based software systems.

Sources

ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal

ToxiFrench: Benchmarking and Enhancing Language Models via CoT Fine-Tuning for French Toxicity Detection

SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

S3LoRA: Safe Spectral Sharpness-Guided Pruning in Adaptation of Agent Planner

Built with on top of