Advances in Backdoor Defense and Mechanistic Interpretability of Large Language Models

The field of large language models (LLMs) is witnessing significant developments in backdoor defense and mechanistic interpretability. Researchers are exploring innovative methods to detect and mitigate backdoor attacks, which pose a substantial threat to the integrity of LLMs. Recent studies have focused on analyzing attention patterns, sample selection strategies, and pruning techniques to improve the robustness of LLMs against backdoor attacks. Additionally, mechanistic interpretability is being used to understand the internal workings of LLMs, including their ability to handle binary truth values and logical reasoning. Noteworthy papers in this area include: The paper on Mechanistic Exploration of Backdoored Large Language Model Attention Patterns, which reveals distinct attention pattern deviations concentrated in later transformer layers. The paper on Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution, which presents a novel method to eliminate backdoor behaviors from LLMs through knowledge dilution using both internal and external mechanisms.

Sources

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

Even Heads Fix Odd Errors: Mechanistic Discovery and Surgical Repair in Transformer Attention

Pruning Strategies for Backdoor Defense in LLMs

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Built with on top of