The field of multimodal large language models (MLLMs) is rapidly advancing, with a focus on improving reasoning and perception capabilities. Recent developments have highlighted the importance of fine-grained visual perception, with several benchmarks and datasets being introduced to evaluate MLLMs' performance in this area. These include VisuRiddles, HueManity, and Do You See Me, which assess MLLMs' ability to recognize and understand abstract graphics, nuanced perceptual tasks, and visual perception errors. Noteworthy papers in this area include VisuRiddles, which introduces a benchmark for abstract visual reasoning and fine-grained perception, and SemVink, which proposes a multimodal evaluation model for bidirectional generation between image and text. Additionally, there is a growing interest in applying MLLMs to real-world applications, such as disaster damage assessment and Humanities and Social Sciences tasks, with benchmarks like HSSBench and MMRB being introduced to evaluate MLLMs' performance in these areas.
Advances in Multimodal Reasoning and Perception
Sources
Fire360: A Benchmark for Robust Perception and Episodic Memory in Degraded 360-Degree Firefighting Videos
VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning