Advances in Large Language Model Safety and Security

The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on safety and security. Recent research has highlighted the vulnerabilities of LLMs to various types of attacks, including jailbreak attacks, cross-user poisoning, and adversarial attacks. These attacks can be used to bypass safety filters, generate harmful content, and exploit user interactions. To address these challenges, researchers are developing new defense strategies, such as prompt-level defenses, model-level defenses, and training-time interventions. Noteworthy papers in this area include 'Evaluating Adversarial Vulnerabilities in Modern Large Language Models' and 'Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations', which demonstrate the effectiveness of these defense strategies in mitigating the risks associated with LLMs. Overall, the field is moving towards a more robust and secure deployment of LLMs, with a focus on responsible AI development and deployment.

Sources

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

MURMUR: Using cross-user chatter to break collaborative language agents in groups

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

RoguePrompt: Dual-Layer Ciphering for Self-Reconstruction to Circumvent LLM Moderation

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation

Can LLMs Threaten Human Survival? Benchmarking Potential Existential Threats from LLMs via Prefix Completion

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Building Resilient Information Ecosystems: Large LLM-Generated Dataset of Persuasion Attacks

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

An Invariant Latent Space Perspective on Language Model Inversion

Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach

Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts

Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Constructing and Benchmarking: a Labeled Email Dataset for Text-Based Phishing and Spam Detection Framework