Explainability and Security in AI-Assisted Decision Making

The field of AI-assisted decision making is moving towards increased transparency and trustworthiness, with a focus on explainability and security. Recent developments have highlighted the importance of evaluating and improving the robustness of Class Activation Maps (CAMs) and other explainability methods to noise and adversarial attacks. Additionally, the rise of Large Language Models (LLMs) has introduced new security threats, such as hidden prompt injection attacks, which can manipulate model outputs without user awareness or system compromise. Researchers are working to develop principled approaches to detect and mitigate these threats, including the use of robustness metrics and safe machine learning techniques. Noteworthy papers in this area include: PhantomLint, which presents a principled approach to detecting hidden LLM prompts in structured documents. Attacking LLMs and AI Agents, which introduces Advertisement Embedding Attacks as a new class of LLM security threats. Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML, which proposes a method for evaluating and improving the reliability of skin lesion classification models.

Sources

Benchmarking Class Activation Map Methods for Explainable Brain Hemorrhage Classification on Hemorica Dataset

PhantomLint: Principled Detection of Hidden LLM Prompts in Structured Documents

Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models

Assessing the Noise Robustness of Class Activation Maps: A Framework for Reliable Model Interpretability

Prompt-in-Content Attacks: Exploiting Uploaded Inputs to Hijack LLM Behavior

Safer Skin Lesion Classification with Global Class Activation Probability Map Evaluation and SafeML

Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review

Built with on top of