Advances in Large Language Model Security

The field of large language models is moving towards improving security and robustness against various types of attacks, including jailbreaks and backdoors. Researchers are exploring innovative methods to mitigate these threats, such as integrating expert models, using knowledge-based approaches, and developing novel attack methods to test model vulnerabilities. Notably, the development of benchmarks and frameworks for evaluating and comparing model safety is becoming increasingly important. Overall, the field is advancing towards creating more secure and reliable large language models.

Noteworthy papers include: CRAKEN, which presents a knowledge-based LLM agent framework that improves cybersecurity capability through contextual decomposition and knowledge-hint injection. JALMBench, which introduces the first comprehensive benchmark to assess the safety of audio language models against jailbreak attacks. One Model Transfer to All, which proposes a novel attack method called ArrAttack that generates robust jailbreak prompts capable of bypassing various defense measures.

Sources

Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration

CRAKEN: Cybersecurity LLM Agent with Knowledge-Based Execution

Backdoors in DRL: Four Environments Focusing on In-distribution Triggers

SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models

One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs

Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Jailbreak Distillation: Renewable Safety Benchmarking

Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models

First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

Model Immunization from a Condition Number Perspective