The field of Large Language Models (LLMs) is rapidly evolving, with a growing focus on security and safety. Recent research has highlighted the vulnerability of LLMs to various attacks, including backdoor attacks, jailbreak attacks, and malicious prompt injections. These attacks can compromise the integrity of LLMs, allowing malicious actors to manipulate their outputs and potentially causing harm. In response, researchers are developing innovative defense mechanisms, such as single-token sentinels and iterative prompting techniques, to detect and prevent these attacks. These advances demonstrate the ongoing efforts to improve the security and reliability of LLMs. Notable papers in this area include:
- STShield, which introduces a lightweight framework for real-time jailbroken judgement,
- Smoke and Mirrors, which presents a novel jailbreaking approach using implicit malicious prompts,
- and Prompt, Divide, and Conquer, which employs distributed prompt processing to bypass safety filters.