Advances in Large Language Model Safety and Security

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and security. Recent research has highlighted the vulnerability of LLMs to jailbreaking attacks, which can bypass safety mechanisms and elicit harmful outputs. In response, several studies have proposed novel defense mechanisms, such as latent self-reflection attacks, mock-court approaches, and comprehensive evaluation frameworks. These advancements aim to improve the robustness of LLMs against adversarial prompts and enhance their overall safety. Noteworthy papers in this area include LARGO, which introduces a latent adversarial reflection attack that surpasses leading jailbreaking techniques, and PandaGuard, a unified framework for evaluating LLM safety. Additionally, research on implicit jailbreak attacks and bit-flip inference cost attacks has revealed new vulnerabilities in LLMs, emphasizing the need for continued innovation in this field.

Sources

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents

CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs

LLMs unlock new paths to monetizing exploits

ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks

Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

Fragments to Facts: Partial-Information Fragment Inference from LLMs

PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks

Safety Devolution in AI Agents

Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion

Lessons from Defending Gemini Against Indirect Prompt Injections

Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs