Advances in Visual Grounded Reasoning

The field of visual grounded reasoning is moving towards developing more sophisticated models that can effectively process high-resolution images and perform robust grounding abilities. Recent research has focused on proposing end-to-end reinforcement learning frameworks that enable large multi-modal models to iteratively focus on key visual regions, leading to improved performance on visual question answering tasks. Additionally, there is a growing interest in evaluating the multimodal cognition of these models, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Noteworthy papers include:

  • High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning, which proposes a novel framework that achieves state-of-the-art results on several benchmarks.
  • MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning, which introduces a comprehensive benchmark to evaluate grounded multimodal cognition and provides insightful analyses of current approaches.
  • Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology, which proposes a diagnostic benchmark and a training paradigm to supervise localization and reasoning jointly with reinforcement learning, leading to significant improvements on several benchmarks.

Sources

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling

MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Built with on top of