The field of multimodal language models is witnessing a significant shift in focus from enhancing reasoning capabilities to evaluating and improving perceptual capabilities. Recent studies have highlighted the limitations of current models in performing human-like perception and reasoning tasks. Despite advancements in raw visual acuity, models lack core visual reasoning capabilities, such as nonlocal visual reasoning, and struggle with simple perceptual tests. Researchers are introducing new benchmarks and evaluation methods, such as the Turing Eye Test, to assess the perceptual capabilities of multimodal language models. These developments suggest that grounding language in visual input can help models infer structured world representations, and that fine-tuning the vision tower can enable rapid adaptation to new tasks. Noteworthy papers in this area include:
- VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs, which presents a comprehensive evaluation of vision-language models' capacity for nonlocal visual reasoning.
- Pixels, Patterns, but No Poetry: To See The World like Humans, which introduces the Turing Eye Test, a challenging perception-oriented benchmark for Multimodal Large Language Models.