Advancements in AI Safety and Security

The field of AI safety and security is rapidly evolving, with a growing focus on developing innovative solutions to mitigate potential threats. Recent research has highlighted the importance of addressing safety concerns in large language models, including the risk of poisoning attacks, emergent deceptive behaviors, and the need for more effective safety alignment mechanisms. Notable advancements include the development of novel frameworks for detecting and mitigating deception, such as PU-Lie and Adversarial Activation Patching, which have shown promising results in identifying and addressing subtle forms of deception. Furthermore, researchers have made significant progress in understanding the dynamics of safety alignment at the embedding level, with techniques like ETTA and LaSM demonstrating improved defense success rates against pop-up-based environmental injection attacks. Additionally, the development of tools like MT4DP and the Safety Gap Toolkit has provided valuable insights into the vulnerability of deep learning-based code search models and open-source models to data poisoning attacks and safeguard removal. The report also notes the importance of considering the impact of development decisions on chain-of-thought monitorability and the need for further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Noteworthy papers include 'Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation', which proposes a novel framework for identifying and attenuating toxicity-sensitive dimensions in embedding space, and 'ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning', which introduces a reasoning-based safety alignment framework for building secure and safe large language models.

Sources

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing

Giving AI Agents Access to Cryptocurrency and Smart Contracts Creates New Vectors of AI Harm

Exploiting Leaderboards for Large-Scale Distribution of Malicious Models

PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning

Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers

LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

MT4DP: Data Poisoning Attack Detection for DL-based Code Search Models via Metamorphic Testing

The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning

The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models

LLMs Encode Harmfulness and Refusal Separately

Thought Purity: Defense Paradigm For Chain-of-Thought Attack

Benchmarking Deception Probes via Black-to-White Performance Boosts

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Built with on top of