Advancements in Large Language Model Safety and Alignment

The field of large language models is shifting towards a greater emphasis on safety and alignment. Researchers are working to identify and mitigate potential risks associated with fine-tuning pre-trained models, such as the reliance on spurious tokens and the degradation of essential capabilities like ignorance awareness. Novel methods, including the Alignment Quality Index (AQI) and Low-Rank Extrapolation (LoX), are being developed to empirically assess and improve model alignment. These advancements aim to provide more robust solutions for LLM fine-tuning and promote the development of safer, more reliable models. Noteworthy papers include:

  • LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model, which highlights the vulnerability of parameter-efficient fine-tuning methods to spurious tokens.
  • Model Organisms for Emergent Misalignment, which introduces improved model organisms to study emergent misalignment in large language models.
  • Alignment Quality Index (AQI), which proposes a novel geometric and prompt-invariant metric to assess LLM alignment.
  • Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning, which presents a fine-tuning approach that preserves ignorance awareness in LLMs.
  • LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning, which demonstrates the effectiveness of Low-Rank Extrapolation in enhancing safety robustness against fine-tuning attacks.

Sources

LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model

Model Organisms for Emergent Misalignment

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Don't Make It Up: Preserving Ignorance Awareness in LLM Fine-Tuning

LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning

Built with on top of