Advances in Large Language Model Safety and Security

The field of large language models (LLMs) is rapidly evolving, with a growing focus on safety and security. Recent developments have highlighted the vulnerability of LLMs to various types of attacks, including jailbreaks, prompt injection, and data poisoning. To address these concerns, researchers have proposed innovative defense mechanisms, such as contextual integrity verification, latent fusion jailbreak detection, and safe-completion training. These approaches aim to improve the robustness of LLMs against adversarial attacks while maintaining their helpfulness and performance. Noteworthy papers in this area include those that propose novel evaluation frameworks for jailbreak attacks, develop more effective attack methods, and investigate the role of context filtering in maintaining safe alignment of LLMs. Overall, the field is moving towards a more comprehensive understanding of LLM safety and security, with a focus on developing practical and effective solutions to mitigate potential risks.

Sources

Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards

DINA: A Dual Defense Framework Against Internal Noise and External Attacks in Natural Language Processing

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

LLM Robustness Leaderboard v1 --Technical report

Quantifying Conversation Drift in MCP via Latent Polytope

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

LLM Unlearning Without an Expert Curated Dataset

Many-Turn Jailbreaking

Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models

Who's the Evil Twin? Differential Auditing for Undesired Behavior

A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Multi-Turn Jailbreaks Are Simpler Than They Seem

Securing Educational LLMs: A Generalised Taxonomy of Attacks on LLMs and DREAD Risk Assessment

Towards Effective MLLM Jailbreaking Through Balanced On-Topicness and OOD-Intensity

From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

An Audit and Analysis of LLM-Assisted Health Misinformation Jailbreaks Against LLMs

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts