Spatial Reasoning in Vision-Language Models

The field of vision-language models (VLMs) is moving towards improved spatial reasoning capabilities, with a focus on developing more comprehensive and challenging benchmarks to evaluate these abilities. Recent research has highlighted the limitations of current VLMs in constructing and maintaining 3D scene representations over time from visual signals, as well as their struggles with specific 3D details like object placement, relationships, and measurements. Furthermore, there is a growing interest in developing VLMs that can effectively parse and understand navigational signs, extract navigational cues, and perform spatial reasoning tasks that require multi-step visual simulations. Noteworthy papers in this area include: OmniSpatial, which introduces a comprehensive benchmark for spatial reasoning grounded in cognitive psychology, covering four major categories and 50 fine-grained subcategories. GenSpace, which presents a novel benchmark and evaluation pipeline to assess the spatial awareness of current image generation models, highlighting three core limitations in their spatial perception.

Sources

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

GenSpace: Benchmarking Spatially-Aware Image Generation

Sign Language: Towards Sign Understanding for Robot Autonomy

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

CIVET: Systematic Evaluation of Understanding in VLMs

EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Built with on top of