Advances in Large Language Model Security

The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on security and safety. Recent research has highlighted the vulnerability of LLMs to various attacks, including backdoor attacks, jailbreak attacks, and malicious prompt injections. These attacks can compromise the integrity of LLMs, allowing malicious actors to manipulate their outputs and potentially causing harm. In response, researchers are developing innovative defense mechanisms, such as single-token sentinels and iterative prompting techniques, to detect and prevent these attacks. These advances demonstrate the ongoing efforts to improve the security and reliability of LLMs. Notable papers in this area include:

  • STShield, which introduces a lightweight framework for real-time jailbroken judgement,
  • Smoke and Mirrors, which presents a novel jailbreaking approach using implicit malicious prompts,
  • and Prompt, Divide, and Conquer, which employs distributed prompt processing to bypass safety filters.

Sources

Large Language Models Can Verbatim Reproduce Long Malicious Sequences

STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models

Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts

sudo rm -rf agentic_security

Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

Built with on top of