Advances in Safe and Aligned Language Models

The field of large language models (LLMs) is shifting towards a greater emphasis on safety and alignment with human values. Recent research has focused on developing methods to mitigate false refusal behavior, improve model safety, and enhance overall performance. One notable trend is the use of sparse representation steering and introspective reasoning to improve model controllability and interpretability. Additionally, there is a growing interest in multi-objective optimization approaches to balance conflicting objectives such as helpfulness, truthfulness, and avoidance of harm.

Noteworthy papers in this area include:

  • Towards LLM Guardrails via Sparse Representation Steering, which proposes a sparse encoding-based representation engineering method to achieve precise and interpretable steering of model behavior.
  • SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment, which introduces a balanced safety Chain of Draft dataset and trains specialized reward models to guide policy optimization.
  • Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach, which proposes a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.

Sources

Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts

Towards LLM Guardrails via Sparse Representation Steering

How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior

SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment

Multi-head Reward Aggregation Guided by Entropy

Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

Built with on top of