Enhancing Safety and Security in Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a growing focus on safety and security. Recent research has highlighted the importance of ensuring that LLMs are aligned with human values and do not pose a risk to individuals or society. A key direction in this area is the development of methods to detect and prevent harmful content generation, such as through the use of soft prompts, context modification, and adaptive system prompts. Another area of focus is the improvement of LLM robustness to latent perturbations and backdoor unalignment attacks, which can be achieved through techniques such as layer-wise adversarial patch training and retrieval-confused generation. Noteworthy papers in this area include 'The Safety Reminder', which introduces a soft prompt tuning approach to enhance safety awareness in VLMs, and 'Sysformer', which proposes a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. Additionally, 'PolyGuard' presents a massive multi-domain safety policy-grounded guardrail dataset, and 'Beyond Reactive Safety' proposes a proof-of-concept framework for risk-aware LLM alignment via long-horizon simulation.

Sources

The Safety Reminder: A Soft Prompt to Reactivate Delayed Safety Awareness in Vision-Language Models

ContextBench: Modifying Contexts for Targeted Latent Activation

Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts

Self-Critique-Guided Curiosity Refinement: Enhancing Honesty and Helpfulness in Large Language Models via In-Context Learning

Probing the Robustness of Large Language Models Safety to Latent Perturbations

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning

PolyGuard: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset

Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems

Automated Detection of Pre-training Text in Black-box LLMs

PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty

Retrieval-Confused Generation is a Good Defender for Privacy Violation Attack of Large Language Models

Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection