The field of visual grounded reasoning is moving towards developing more sophisticated models that can effectively process high-resolution images and perform robust grounding abilities. Recent research has focused on proposing end-to-end reinforcement learning frameworks that enable large multi-modal models to iteratively focus on key visual regions, leading to improved performance on visual question answering tasks. Additionally, there is a growing interest in evaluating the multimodal cognition of these models, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Noteworthy papers include:
- High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning, which proposes a novel framework that achieves state-of-the-art results on several benchmarks.
- MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning, which introduces a comprehensive benchmark to evaluate grounded multimodal cognition and provides insightful analyses of current approaches.
- Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology, which proposes a diagnostic benchmark and a training paradigm to supervise localization and reasoning jointly with reinforcement learning, leading to significant improvements on several benchmarks.