Advances in Safety and Security of Large Language Models

The field of large language models is rapidly evolving, with a growing focus on improving safety and security. Recent research has highlighted the vulnerabilities of these models to adversarial attacks, data breaches, and malicious fine-tuning. In response, researchers have developed innovative methods to enhance safety, including reinforcement learning approaches, safety-aware probing optimization, and backdoor detection frameworks. Noteworthy papers in this area include AutoRAN, which demonstrates the effectiveness of weak-to-strong jailbreaking attacks, and CTRAP, which proposes a novel paradigm for inducing model collapse to prevent harmful fine-tuning. Overall, the field is moving towards developing more robust and secure large language models that can mitigate potential risks and ensure safe deployment in real-world applications.

Sources

AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models

GenoArmory: A Unified Evaluation Framework for Adversarial Attacks on Genomic Foundation Models

Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning

SifterNet: A Generalized and Model-Agnostic Trigger Purification Approach

sudoLLM : On Multi-role Alignment of Language Models

Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models

SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

HAVA: Hybrid Approach to Value-Alignment through Reward Weighing for Reinforcement Learning

How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning

Finetuning-Activated Backdoors in LLMs

Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Backdoor Cleaning without External Guidance in MLLM Fine-tuning