Causal Learning and Interpretability in AI

The field of AI research is moving towards a deeper understanding of causal relationships and interpretable models. Recent developments have shown that causal representation learning can be used to improve the robustness of language models and estimate causal effects in various domains. The use of prior-data fitted networks and Bayesian filtering has also been proposed as a way to perform causal inference and emulate complex systems. Furthermore, techniques such as semi-nonnegative matrix factorization have been shown to be effective in decomposing neural network activations into interpretable features. Notable papers in this area include:

Preference Learning for AI Alignment: a Causal Perspective, which proposes a causal paradigm for aligning large language models with human values.
Foundation Models for Causal Inference via Prior-Data Fitted Networks, which introduces a comprehensive framework for training foundation models in various causal inference settings.
Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization, which presents a method for identifying interpretable features in neural networks.

Causal Learning and Interpretability in AI

Sources