Global Visual Perception in Large Vision-Language Models

The field of large vision-language models (LVLMs) is moving towards a deeper understanding of global visual perception, with a focus on evaluating and improving the ability of models to perceive and understand visual features beyond local shortcuts. Recent research has highlighted the limitations of current models in this regard, with even the most powerful models struggling to perform better than random chance on certain tasks. This has led to a shift towards developing new benchmarks and evaluation methods that can rigorously assess the global visual perception capabilities of LVLMs. Noteworthy papers in this area include TopoPerception, which introduces a new benchmark for evaluating global visual perception, and Visual Room 2.0, which proposes a hierarchical benchmark for evaluating perception-cognition alignment in multi-modal large language models.

Sources

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Visual Room 2.0: Seeing is Not Understanding for MLLMs

Task Addition and Weight Disentanglement in Closed-Vocabulary Models

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Built with on top of