Advances in AI Safety and Alignment

The field of AI safety and alignment is rapidly evolving, with a growing focus on developing innovative methods to ensure that large language models (LLMs) and vision-language models (VLMs) behave safely and align with human values. Recent research has explored the use of influence functions to prune harmful training examples, simplified reinforcement learning frameworks to incentivize safety alignment, and cost-constrained runtime monitors to detect and prevent misaligned outputs. Another area of research has focused on steering out-of-distribution generalization and improving safety via intent awareness for VLMs.

Noteworthy papers in this area include AlphaAlign, which proposes a simple yet effective pure reinforcement learning framework with verifiable safety reward to incentivize latent safety awareness. Other notable works include GrAInS, which introduces a gradient-based attribution method for inference-time steering of LLMs and VLMs, and SafeWork-R1, which demonstrates the coevolution of capabilities and safety in a multimodal reasoning model. LoRA is also a notable method for safety alignment of reasoning LLMs, which achieves high safety levels without compromising reasoning abilities. Layer-Aware Representation Filtering is another method that identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features.

Sources

Influence Functions for Preference Dataset Pruning

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

Combining Cost-Constrained Runtime Monitors for AI Safety

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

LoRA is All You Need for Safety Alignment of Reasoning LLMs

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs

SafeWork-R1: Coevolving Safety and Intelligence under the AI-45$^{\circ}$ Law

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment

Built with on top of