Interpretability and Explainability in AI Models

The field of AI is moving towards developing more interpretable and explainable models. Recent research focuses on discovering unknown concepts and understanding decision-making processes in large language models and image classification. Techniques such as sparse autoencoders and concept-based contrastive explanations are being explored to extract human-understandable features and provide insights into model behavior. Neuro-symbolic learning frameworks are also being developed to transform decision tree ensembles into sparse, partially connected neural networks. Notable papers include:

  • Unveiling Decision-Making in LLMs for Text Classification, which presents a novel SAE-based architecture for text classification and evaluates its effectiveness in extracting interpretable concepts.
  • Toward Simple and Robust Contrastive Explanations for Image Classification, which implements concept-based contrastive explanations for image classification by leveraging instance similarity and concept relevance.
  • BranchNet: A Neuro-Symbolic Learning Framework for Structured Multi-Class Classification, which introduces a framework that transforms decision tree ensembles into sparse, partially connected neural networks.

Sources

Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Toward Simple and Robust Contrastive Explanations for Image Classification by Leveraging Instance Similarity and Concept Relevance

BranchNet: A Neuro-Symbolic Learning Framework for Structured Multi-Class Classification

Built with on top of