The field of multimodal large language models (MLLMs) is rapidly advancing, with a focus on improving reasoning and planning capabilities. Recent research has highlighted the limitations of current MLLM benchmarks, which often rely on heuristic-based task groupings and lack clear cognitive targets. To address this, new frameworks and benchmarks have been proposed, such as those using structural equation modeling and cognitive science-inspired approaches. These developments aim to provide more interpretable and theoretically grounded evaluations of MLLM abilities.
Noteworthy papers include:
- MARBLE, a challenging multimodal reasoning benchmark that scrutinizes MLLMs' ability to reason step-by-step through complex multimodal problems, and
- MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions.