Advances in Large Language Model Security

The field of large language model security is rapidly evolving, with a growing focus on developing innovative methods to anonymize user-generated text, jailbreak large language models, and detect steganographic capabilities. Recent research has explored the use of locally deployed smaller-scale language models for anonymization, as well as reinforcement learning frameworks for obfuscation-based jailbreak attacks. Noteworthy papers in this area include AgentStealth, which proposes a self-reinforcing LLM anonymization framework, and MetaCipher, which introduces a novel obfuscation-based jailbreak framework with a reinforcement learning-based dynamic cipher selection mechanism. Additionally, VERA and AutoAdv have made significant contributions to the field, with VERA introducing a variational inference framework for jailbreaking and AutoAdv presenting a novel framework for automated adversarial prompting. These advancements highlight the importance of continued research in this area to improve the security and robustness of large language models. Notable papers: AgentStealth achieves state-of-the-art anonymization effectiveness and utility. MetaCipher outperforms existing obfuscation-based jailbreak methods with a 92% attack success rate.

Sources

AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

VERA: Variational Inference Framework for Jailbreaking Large Language Models

AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models

Early Signs of Steganographic Capabilities in Frontier LLMs

Built with on top of