Advances in Interpretable Multimodal Models

The field of multimodal models is moving towards more interpretable and robust architectures. Recent developments have focused on improving the faithfulness of attribution methods, which is essential for understanding the decision-making process of these models. Researchers are exploring new approaches to address the limitations of traditional methods, such as submodular subset selection, and are proposing novel algorithms that enable efficient object-level interpretation. Additionally, there is a growing interest in using attribution methods to guide the training process and improve the generalization of models on out-of-distribution data. Noteworthy papers in this area include: PhaseWin Search Framework, which proposes a novel phase-window search algorithm for efficient object-level interpretation. Did Models Sufficient Learn, which introduces Subset-Selected Counterfactual Augmentation to improve model generalization. Concept Regions Matter, which presents a new cluster-importance approach for benchmarking contrastive vision-language models.

Sources

PhaseWin Search Framework Enable Efficient Object-Level Interpretation

Did Models Sufficient Learn? Attribution-Guided Training via Subset-Selected Counterfactual Augmentation

Semantic Prioritization in Visual Counterfactual Explanations with Weighted Segmentation and Auto-Adaptive Region Selection

Concept Regions Matter: Benchmarking CLIP with a New Cluster-Importance Approach

CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Built with on top of