Safety Alignment in Large Language Models

The field of large language models (LLMs) is moving towards a greater emphasis on safety alignment, with a focus on developing methods to evaluate and improve the safety of these models. Researchers are investigating the impact of fine-tuning on safety, with findings suggesting that it can compromise safety alignment even when using benign data. To address this issue, novel approaches such as pruning-based methods and persona features control are being proposed to improve safety while preserving performance. Noteworthy papers include: PL-Guard, which introduces a benchmark dataset for language model safety classification in Polish. Safe Pruning LoRA, which proposes a pruning-based approach to improve safety alignment in LLMs. Persona Features Control Emergent Misalignment, which investigates the mechanisms behind emergent misalignment in LLMs and proposes mitigation strategies.

Safety Alignment in Large Language Models

Sources