Advances in Safe and Explainable Language Models

The field of language models is moving towards safer and more explainable models. Recent research has focused on developing methods to defend against harmful prompts and attacks, such as jailbreak attacks and backdoor attacks. Additionally, there is a growing interest in explainability methods that can provide insights into model behavior and decision-making processes. These methods aim to identify the concepts and semantics that influence model outputs, enabling more transparent and trustworthy language models. Noteworthy papers in this area include LiteLMGuard, which proposes a lightweight on-device prompt filtering method to safeguard small language models, and ConceptX, which introduces a concept-level explainability method for auditing and steering language model responses. Other notable works include Dialz, a Python toolkit for steering vectors, and Adversarial Suffix Filtering, a defense pipeline for large language models.

Sources

LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities

Dialz: A Python Toolkit for Steering Vectors

Natural Reflection Backdoor Attack on Vision Language Model for Autonomous Driving

Discovering Concept Directions from Diffusion-based Counterfactuals via Latent Clustering

One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models

Concept-Level Explainability for Auditing & Steering LLM Responses

Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights

Adversarial Suffix Filtering: a Defense Pipeline for LLMs

Built with on top of