The field of spatial intelligence and 3D reasoning is rapidly advancing, with a focus on developing models that can understand and manipulate spatial relationships. Recent research has highlighted the limitations of current large multimodal models in this regard, and has proposed innovative solutions such as the integration of 3D-informed data and architectural designs. Notably, the development of benchmark datasets and evaluation frameworks is also underway, aiming to assess the spatial intelligence of large vision-language models. These advancements have the potential to significantly improve the capabilities of machines in tasks that require complex spatial reasoning.
Some papers are particularly noteworthy, including: SpatialLLM, which introduces a compound 3D-informed design towards spatially-intelligent large multimodal models, and surpasses GPT-4o performance by 8.7%. Beyond Recognition, which investigates the ability of Vision Language Models to perform visual perspective taking, and reveals a gap between surface-level object recognition and deeper spatial and perspective reasoning. SITE, which introduces a benchmark dataset for spatial intelligence thorough evaluation, and demonstrates a positive correlation between a model's spatial reasoning proficiency and its performance on an embodied AI task.