Advances in Neural Network Interpretability

The field of neural network interpretability is moving towards a deeper understanding of the internal mechanisms of complex models. Recent research has focused on disentangling the notions of decodability and causality, highlighting their complementary roles in representation. Additionally, there is a growing interest in developing physics-based perspectives on language transformers, leveraging insights from quantum mechanics to better understand the generation process. Tensor-based methods are also being explored for improved dataset characterization, offering enhanced interpretability and actionable intelligence. Noteworthy papers include: CAST, which introduces a novel framework for analyzing transformer layer functions through spectral tracking, revealing distinct behaviors between encoder-only and decoder-only models. QLENS, which proposes a quantum perspective on language transformers, translating insights from quantum mechanics to natural language processing. Circuit Insights, which introduces WeightLens and CircuitLens, two methods that go beyond activation-based analysis, increasing interpretability robustness and enhancing scalable mechanistic analysis of circuits.

Sources

Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

QLENS: Towards A Quantum Perspective of Language Transformers

Data Understanding Survey: Pursuing Improved Dataset Characterization Via Tensor-based Methods

CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions

On the Identifiability of Tensor Ranks via Prior Predictive Matching

Circuit Insights: Towards Interpretability Beyond Activations

Built with on top of