Advances in Safe and Explainable Language Models

The field of language models is moving towards safer and more explainable models. Recent research has focused on developing methods to defend against harmful prompts and attacks, such as jailbreak attacks and backdoor attacks. Additionally, there is a growing interest in explainability methods that can provide insights into model behavior and decision-making processes. These methods aim to identify the concepts and semantics that influence model outputs, enabling more transparent and trustworthy language models. Noteworthy papers in this area include LiteLMGuard, which proposes a lightweight on-device prompt filtering method to safeguard small language models, and ConceptX, which introduces a concept-level explainability method for auditing and steering language model responses. Other notable works include Dialz, a Python toolkit for steering vectors, and Adversarial Suffix Filtering, a defense pipeline for large language models.

Advances in Safe and Explainable Language Models

Sources