Interpretability of Large Language Models

The field of large language models is moving towards improved interpretability, with a focus on understanding the complex interactions between input features and model outputs. Recent work has highlighted the importance of accounting for hierarchical interactions, self-repair mechanisms, and input-output functionality in these models. Techniques such as sparse feature interactions, gradient interaction modifications, and statistical model-agnostic interpretability have shown promise in providing more faithful explanations of model behavior. Additionally, research has explored the use of self-critique and refinement frameworks to improve the faithfulness of natural language explanations generated by these models. Noteworthy papers include ProxySPEX, which achieves more efficient discovery of sparse feature interactions, and GIM, which improves interpretability by accounting for self-repair during backpropagation. SMILE is also notable for its model-agnostic approach to explaining how large language models respond to different parts of a prompt, and SR-NLE for its ability to refine natural language explanations through an iterative critique and refinement process.

Interpretability of Large Language Models

Sources