Interpretability of Large Language Models

The field of large language models is moving towards improved interpretability, with a focus on understanding the complex interactions between input features and model outputs. Recent work has highlighted the importance of accounting for hierarchical interactions, self-repair mechanisms, and input-output functionality in these models. Techniques such as sparse feature interactions, gradient interaction modifications, and statistical model-agnostic interpretability have shown promise in providing more faithful explanations of model behavior. Additionally, research has explored the use of self-critique and refinement frameworks to improve the faithfulness of natural language explanations generated by these models. Noteworthy papers include ProxySPEX, which achieves more efficient discovery of sparse feature interactions, and GIM, which improves interpretability by accounting for self-repair during backpropagation. SMILE is also notable for its model-agnostic approach to explaining how large language models respond to different parts of a prompt, and SR-NLE for its ability to refine natural language explanations through an iterative critique and refinement process.

Sources

ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

GIM: Improved Interpretability for Large Language Models

Understanding Gated Neurons in Transformers from Their Input-Output Functionality

Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

Self-Critique and Refinement for Faithful Natural Language Explanations

Built with on top of