Advances in Interpretable Vision Models and Multimodal Understanding

The field of computer vision and natural language processing is moving towards developing more interpretable and reliable models. Recent studies have highlighted the importance of understanding how models process and integrate local and global features, as well as the need for more accurate and transparent attention mechanisms. Furthermore, there is a growing interest in multimodal models that can effectively comprehend and generate text-image content, with a focus on improving their ability to detect and understand visual cues. Noteworthy papers in this area include those that propose new techniques for document attribution and variational visual question answering, which have shown promising results in enhancing model interpretability and reliability. Additionally, research on multimodal small language models and their application to specialized domains such as remote sensing has demonstrated significant potential for improving model performance and efficiency. Notable papers include:

  • Variational Visual Question Answering, which proposes a variational approach to improving calibration and reliability in multimodal models.
  • MilChat, which introduces a lightweight multimodal language model for remote sensing applications.
  • Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models, which defines a new metric for evaluating visual understanding in multimodal models.

Sources

Register and CLS tokens yield a decoupling of local and global features in large ViTs

Do Not Change Me: On Transferring Entities Without Modification in Neural Machine Translation -- a Multilingual Perspective

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

Document Attribution: Examining Citation Relationships using Large Language Models

MilChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Remote Sensing

Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation

Variational Visual Question Answering

Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis

Built with on top of