Advances in Multimodal Reasoning and Interpretability

The field of multimodal research is moving towards improving the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs) in various domains, including visual question answering, 3D scene understanding, and psychological analysis. Researchers are exploring new frameworks and methods to enhance the interpretability and transparency of these models, such as introducing intermediate representations, using attention mechanisms, and developing novel evaluation benchmarks. Notably, the development of datasets like PartNeXt and SCENECOT-185K is facilitating progress in fine-grained and hierarchical 3D part understanding and grounded chain-of-thought reasoning. Furthermore, studies on the limitations of active reasoning in MLLMs and the importance of gradients for language-assisted image clustering are providing valuable insights into the underlying mechanisms of these models. Overall, the field is witnessing significant advancements in multimodal reasoning and interpretability, with potential applications in areas like robotics, computer vision, and human-computer interaction. Noteworthy papers include: COGS, which introduces a data-efficient framework for equipping MLLMs with advanced reasoning abilities; SceneCOT, which presents a novel framework for eliciting grounded chain-of-thought reasoning in 3D scenes; and Speculative Verdict, which proposes a training-free framework for information-intensive visual reasoning via speculation.

Sources

Composition-Grounded Instruction Synthesis for Visual Reasoning

When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLMs

Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model

Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition

On the Provable Importance of Gradients for Language-Assisted Image Clustering

Structured Interfaces for Automated Reasoning with 3D Scene Graphs

Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

See, Think, Act: Online Shopper Behavior Simulation with VLM Agents

Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

[De|Re]constructing VLMs' Reasoning in Counting

I Spy With My Model's Eye: Visual Search as a Behavioural Test for MLLMs

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Built with on top of