Advances in Large Language Model Safety and Robustness

The field of large language models (LLMs) is moving towards improving safety and robustness. Researchers are exploring various approaches to prevent harmful outputs, including modular prompting frameworks, adversarial robustness techniques, and collective prompting governance. These innovations aim to enhance the reliability and trustworthiness of LLMs in real-world applications. Noteworthy papers in this area include: PromptGuard, which introduces a novel prompting framework for preventing harmful information generation. CIARD, which proposes a cyclic iterative adversarial robustness distillation method for improving model robustness. MUSE, which presents a comprehensive framework for tackling multi-turn jailbreaks in LLMs. DeepRefusal, which introduces a robust safety alignment framework that probabilistically ablates refusal direction to defend against adversarial attacks.

Sources

PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Adversarial robustness through Lipschitz-Guided Stochastic Depth in Neural Networks

Prompt Commons: Collective Prompting as Governance for Urban AI

CIARD: Cyclic Iterative Adversarial Robustness Distillation

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Jailbreaking Large Language Models Through Content Concretization

A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

LLM Jailbreak Detection for (Almost) Free!

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction

Built with on top of