Advances in Safety and Control of Large Language Models

The field of large language models is moving towards improving safety and control, with a focus on developing innovative methods to mitigate potential risks and vulnerabilities. Recent research has explored various approaches, including multimodal prompt decoupling attacks, adaptive subspace steering, and backdoor attribution, to name a few. These advancements aim to enhance the reliability and trustworthiness of large language models, ensuring their safe deployment in real-world applications. Noteworthy papers in this area include Multimodal Prompt Decoupling Attack, which proposes a novel attack method to bypass safety filters, and Backdoor Attribution, which introduces a framework to elucidate and control backdoor mechanisms in language models. Additionally, papers like SafeSteer and ASGuard have made significant contributions to the development of efficient defense mechanisms against jailbreak attacks.

Sources

Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models

SafeSteer: Adaptive Subspace Steering for Efficient Jailbreak Defense in Vision-Language Models

JaiLIP: Jailbreaking Vision-Language Models via Loss Guided Image Perturbation

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

The Rogue Scalpel: Activation Steering Compromises LLM Safety

Jailbreaking on Text-to-Video Models via Scene Splitting Strategy

Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement

Toward Preference-aligned Large Language Models via Residual-based Model Steering

TokenSwap: Backdoor Attack on the Compositional Understanding of Large Vision-Language Models

EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Alignment-Aware Decoding

Reinforcement Learning-Based Prompt Template Stealing for Text-to-Image Models

Drones that Think on their Feet: Sudden Landing Decisions with Embodied AI

CHAI: Command Hijacking against embodied AI

Backdoor Attacks Against Speech Language Models

A Comparative Analysis of Sparse Autoencoder and Activation Difference in Language Model Steering

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation