Interpretability and Explainability in AI Models

The field of AI is moving towards developing more interpretable and explainable models. Recent research focuses on discovering unknown concepts and understanding decision-making processes in large language models and image classification. Techniques such as sparse autoencoders and concept-based contrastive explanations are being explored to extract human-understandable features and provide insights into model behavior. Neuro-symbolic learning frameworks are also being developed to transform decision tree ensembles into sparse, partially connected neural networks. Notable papers include:

Unveiling Decision-Making in LLMs for Text Classification, which presents a novel SAE-based architecture for text classification and evaluates its effectiveness in extracting interpretable concepts.
Toward Simple and Robust Contrastive Explanations for Image Classification, which implements concept-based contrastive explanations for image classification by leveraging instance similarity and concept relevance.
BranchNet: A Neuro-Symbolic Learning Framework for Structured Multi-Class Classification, which introduces a framework that transforms decision tree ensembles into sparse, partially connected neural networks.

Interpretability and Explainability in AI Models

Sources