Advances in Multimodal Large Language Models

The field of multimodal large language models is rapidly advancing, with a focus on improving visual reasoning and perception capabilities. Recent developments have led to the creation of models that can effectively integrate visual and textual information to perform complex tasks such as visual question answering, object detection, and image generation. A key direction in this field is the development of models that can think visually, using spatio-temporal chain-of-thought reasoning to enable more accurate and informative outputs. Another important area of research is the control of knowledge priors in vision-language models, allowing for more flexible and accurate reasoning. Noteworthy papers in this area include 'Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts', which introduces a novel dataset and mechanism for controlling model outputs, and 'Thinking with Generated Images', which enables models to actively construct intermediate visual thoughts and refine them through self-critique.

Sources

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

One RL to See Them All: Visual Triple Unified Reinforcement Learning

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Caption This, Reason That: VLMs Caught in the Middle

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test

Thinking with Generated Images

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

Are MLMs Trapped in the Visual Room?

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Are Reasoning Models More Prone to Hallucination?

Grounded Reinforcement Learning for Visual Reasoning

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought