Advances in Robustness and Security of Large Language Models

The field of large language models (LLMs) is moving towards improving robustness and security against adversarial attacks and backdoor injections. Researchers are exploring new techniques to certify the robustness of LLMs, such as randomized smoothing and prompt whitelisting. Additionally, there is a growing interest in developing algorithms that can detect and invert backdoor triggers in LLMs. Noteworthy papers in this area include: Randomized Smoothing Meets Vision-Language Models, which develops a theory for associating the number of samples with the corresponding robustness radius, making robustness certification both well-defined and computationally feasible for state-of-the-art VLMs. SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection, which introduces a novel white-box jailbreak method that achieves a 51% improvement over the best-performing baseline on the HarmBench test set.

Advances in Robustness and Security of Large Language Models

Sources