Advances in Multimodal Reasoning and Reward Modeling

The field of multimodal reasoning and reward modeling is rapidly evolving, with a focus on improving the accuracy and explainability of large language models (LLMs) in complex tasks such as visual question answering, math reasoning, and chart reasoning. Recent developments have centered around the use of reinforcement learning with verifiable rewards (RLVR) and process-level supervision to enhance the reasoning capabilities of LLMs. Notable advancements include the proposal of new frameworks and methods that integrate RLVR with process-level supervision, such as Answer-Consistent Reinforcement Learning (ACRE) and AutoRubric-R1V, which have achieved state-of-the-art performance on various multimodal reasoning benchmarks. Additionally, there is a growing emphasis on developing more reliable and fine-grained evaluation methods for LLM-generated math proofs and step-level reasoning. Overall, the field is moving towards more robust, interpretable, and generalizable models that can effectively reason and provide explanations for their decisions. Noteworthy papers include: Answer-Consistent Reinforcement Learning (ACRE), which modifies the GRPO algorithm with an auxiliary consistency check to improve answer consistency. AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards.

Sources

Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment

Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

From <Answer> to <Think>: Multidimensional Supervision of Reasoning Process for LLM Optimization

Confidence as a Reward: Transforming LLMs into Reward Models

Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math

Reliable Fine-Grained Evaluation of Natural Language Math Proofs

AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

Built with on top of