Interpretable Neural Networks through Sparse Autoencoders and Logic-Based Models

The field of interpretable neural networks is moving towards developing techniques that provide insights into the decision-making processes of complex models. Recent research has focused on sparse autoencoders (SAEs) and logic-based models, which have shown promise in uncovering human-interpretable features and representations. SAEs have been improved through the introduction of orthogonality constraints, binary sparse coding, and new variants such as AbsTopK, which enables the discovery of bidirectional features. Logic-based models, such as the Tsetlin Machine, have demonstrated competitive performance with neural networks while maintaining interpretability. These advancements have the potential to increase transparency and trust in neural network models, particularly in applications where high performance is not enough to adopt a proposed solution. Noteworthy papers include OrtSAE, which introduces orthogonality constraints to mitigate feature absorption and composition, and AbsTopK, which enables the discovery of bidirectional features. Additionally, the Tsetlin Machine has shown promise in providing transparent logic-based classification, with a proposed methodology for generating local interpretations and global class representations.

Sources

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

(Sometimes) Less is More: Mitigating the Complexity of Rule-based Representation for Interpretable Classification

Measuring Sparse Autoencoder Feature Sensitivity

Binary Sparse Coding for Interpretability

AbsTopK: Rethinking Sparse Autoencoders For Bidirectional Features

GPT and Prejudice: A Sparse Approach to Understanding Learned Representations in Large Language Models

A Methodology for Transparent Logic-Based Classification Using a Multi-Task Convolutional Tsetlin Machine

Built with on top of