Causal Learning and Interpretability in AI

The field of AI research is moving towards a deeper understanding of causal relationships and interpretable models. Recent developments have shown that causal representation learning can be used to improve the robustness of language models and estimate causal effects in various domains. The use of prior-data fitted networks and Bayesian filtering has also been proposed as a way to perform causal inference and emulate complex systems. Furthermore, techniques such as semi-nonnegative matrix factorization have been shown to be effective in decomposing neural network activations into interpretable features. Notable papers in this area include:

  • Preference Learning for AI Alignment: a Causal Perspective, which proposes a causal paradigm for aligning large language models with human values.
  • Foundation Models for Causal Inference via Prior-Data Fitted Networks, which introduces a comprehensive framework for training foundation models in various causal inference settings.
  • Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization, which presents a method for identifying interpretable features in neural networks.

Sources

Preference Learning for AI Alignment: a Causal Perspective

Do-PFN: In-Context Learning for Causal Effect Estimation

Causal Climate Emulation with Bayesian Filtering

Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning

Foundation Models for Causal Inference via Prior-Data Fitted Networks

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

Built with on top of