Advances in Robustness and Security of Large Language Models

The field of large language models (LLMs) is moving towards improving robustness and security against adversarial attacks and backdoor injections. Researchers are exploring new techniques to certify the robustness of LLMs, such as randomized smoothing and prompt whitelisting. Additionally, there is a growing interest in developing algorithms that can detect and invert backdoor triggers in LLMs. Noteworthy papers in this area include: Randomized Smoothing Meets Vision-Language Models, which develops a theory for associating the number of samples with the corresponding robustness radius, making robustness certification both well-defined and computationally feasible for state-of-the-art VLMs. SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection, which introduces a novel white-box jailbreak method that achieves a 51% improvement over the best-performing baseline on the HarmBench test set.

Sources

Randomized Smoothing Meets Vision-Language Models

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Inverting Trojans in LLMs

LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs

Algorithms for Adversarially Robust Deep Learning

Trigger Where It Hurts: Unveiling Hidden Backdoors through Sensitivity with Sensitron

bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs

Built with on top of