Interpretable Neural Networks through Sparse Autoencoders and Logic-Based Models

The field of interpretable neural networks is moving towards developing techniques that provide insights into the decision-making processes of complex models. Recent research has focused on sparse autoencoders (SAEs) and logic-based models, which have shown promise in uncovering human-interpretable features and representations. SAEs have been improved through the introduction of orthogonality constraints, binary sparse coding, and new variants such as AbsTopK, which enables the discovery of bidirectional features. Logic-based models, such as the Tsetlin Machine, have demonstrated competitive performance with neural networks while maintaining interpretability. These advancements have the potential to increase transparency and trust in neural network models, particularly in applications where high performance is not enough to adopt a proposed solution. Noteworthy papers include OrtSAE, which introduces orthogonality constraints to mitigate feature absorption and composition, and AbsTopK, which enables the discovery of bidirectional features. Additionally, the Tsetlin Machine has shown promise in providing transparent logic-based classification, with a proposed methodology for generating local interpretations and global class representations.

Interpretable Neural Networks through Sparse Autoencoders and Logic-Based Models

Sources