Advancements in Large Language Model Safety and Alignment

The field of large language models (LLMs) is moving towards improving safety and alignment, with a focus on developing innovative methods to mitigate potential risks and vulnerabilities. Recent research has explored the use of sparse autoencoders, parameter-efficient fine-tuning, and visual prompt frameworks to enhance safety and reduce over-refusal. Additionally, studies have investigated the resilience of LLMs to prompt injection attacks and proposed frameworks for evaluating their robustness. Notable papers in this area include: Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts, which achieved an 18.9% improvement in safety performance and an 11.1% increase in utility. DRIP: Defending Prompt Injection via De-instruction Training and Residual Fusion Model Architecture, which outperformed state-of-the-art defenses and improved role separation by 49%. Reimagining Safety Alignment with An Image, which proposed a visual prompt framework that enhances security while reducing over-refusal. Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs, which found that adapter-based approaches tend to improve safety scores and are the least disruptive to fairness.

Advancements in Large Language Model Safety and Alignment

Sources