The field of large language models (LLMs) is rapidly advancing, with a growing focus on safety and security. Recent research has highlighted the importance of ensuring that LLMs are aligned with human values and do not pose a risk to individuals or society. A key direction in this area is the development of methods to detect and prevent harmful content generation, such as through the use of soft prompts, context modification, and adaptive system prompts. Another area of focus is the improvement of LLM robustness to latent perturbations and backdoor unalignment attacks, which can be achieved through techniques such as layer-wise adversarial patch training and retrieval-confused generation. Noteworthy papers in this area include 'The Safety Reminder', which introduces a soft prompt tuning approach to enhance safety awareness in VLMs, and 'Sysformer', which proposes a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. Additionally, 'PolyGuard' presents a massive multi-domain safety policy-grounded guardrail dataset, and 'Beyond Reactive Safety' proposes a proof-of-concept framework for risk-aware LLM alignment via long-horizon simulation.
Enhancing Safety and Security in Large Language Models
Sources
Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty
Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models