Advances in Safe and Robust Language Models

The field of natural language processing is moving towards developing safer and more robust language models. Recent research has focused on improving the security and usability of large language models, with a particular emphasis on mitigating the risks of adversarial attacks and promoting more transparent and accountable decision-making processes. One of the key directions in this area is the development of novel defense frameworks that combine activation-level intervention with policy-level optimization to enhance model robustness. Another important trend is the exploration of methods for manipulating transformer-based models through principled interventions at multiple levels, including prompts, activations, and weights. Noteworthy papers in this area include: Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models, which proposes a novel two-stage defense framework for enhancing model robustness. AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs, which introduces a bi-level optimization procedure for training LLMs to be resistant to tampering. TrinityX, a modular alignment framework that incorporates a Mixture of Calibrated Experts within the Transformer architecture to align LLM outputs with the principles of Helpfulness, Harmlessness, and Honesty.

Sources

Activation Steering Meets Preference Optimization: Defense Against Jailbreaks in Vision Language Models

Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports

Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Towards an Automated Framework to Audit Youth Safety on TikTok

Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint

MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Too Helpful, Too Harmless, Too Honest or Just Right?

Built with on top of