Global Visual Perception in Large Vision-Language Models

The field of large vision-language models (LVLMs) is moving towards a deeper understanding of global visual perception, with a focus on evaluating and improving the ability of models to perceive and understand visual features beyond local shortcuts. Recent research has highlighted the limitations of current models in this regard, with even the most powerful models struggling to perform better than random chance on certain tasks. This has led to a shift towards developing new benchmarks and evaluation methods that can rigorously assess the global visual perception capabilities of LVLMs. Noteworthy papers in this area include TopoPerception, which introduces a new benchmark for evaluating global visual perception, and Visual Room 2.0, which proposes a hierarchical benchmark for evaluating perception-cognition alignment in multi-modal large language models.

Global Visual Perception in Large Vision-Language Models

Sources