Advances in Large Language Model Security

The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on security and safety. Recent developments have highlighted the vulnerability of LLMs to various types of attacks, including jailbreaks and data extraction attacks. To mitigate these risks, researchers are exploring new techniques such as consistency training, diffusion-based methods, and automated framework for strategy discovery. These approaches aim to improve the robustness of LLMs and prevent malicious prompts from bypassing safety guardrails. Notably, some papers have introduced innovative methods to detect and prevent jailbreaks, such as CPT-Filtering and ShadowLogic. Others have demonstrated the effectiveness of meta-optimization frameworks, like AMIS, in evolving jailbreak prompts and scoring templates. Furthermore, the importance of considering cross-lingual generalization and multi-turn interactions in LLM security has been emphasized. Overall, the field is moving towards developing more robust and secure LLMs that can withstand various types of attacks. Noteworthy papers include: Broken-Token, which introduces CPT-Filtering, a novel technique to detect encoded text. Align to Misalign, which presents AMIS, a meta-optimization framework for evolving jailbreak prompts and scoring templates. AutoAdv, which demonstrates the effectiveness of automated multi-turn jailbreaking. ShadowLogic, which introduces a method for creating backdoors in white-box LLMs.

Sources

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

Consistency Training Helps Stop Sycophancy and Jailbreaks

flowengineR: A Modular and Extensible Framework for Fair and Reproducible Workflow Design in R

Diffusion LLMs are Natural Adversaries for any LLM

Exploiting Latent Space Discontinuities for Building Universal LLM Jailbreaks and Data Extraction Attacks

Red-teaming Activation Probes using Prompted LLMs

ShadowLogic: Backdoors in Any Whitebox LLM

Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Black-Box Guardrail Reverse-engineering Attack

Built with on top of