The field of multimodal reasoning is rapidly advancing, with a focus on improving the ability of models to understand and reason about complex visual and textual information. Recent developments have highlighted the importance of perception-grounded multimodal reasoning, where models are able to explicitly ground their reasoning in visual and textual evidence. This has led to the development of new frameworks and methods that incorporate reinforcement learning, uncertainty estimation, and visual perception to improve multimodal reasoning capabilities. Notably, the use of reinforcement learning with verifiable rewards has shown promise in improving reasoning in large language models, while the incorporation of visual uncertainty has enabled models to better explore and understand visual inputs.
Some noteworthy papers in this area include: MIRG-RL, which proposes a unified framework for multi-image reasoning and grounding with reinforcement learning, achieving state-of-the-art performance in multi-image grounding benchmarks. CapPO, which introduces a novel RL framework that explicitly enforces perceptual consistency during policy optimization, resulting in competitive performance on math-focused and general reasoning benchmarks. VAPO, which proposes a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories, achieving new state-of-the-art results on a wide range of established benchmarks.