Advancements in Spatial Intelligence and 3D Scene Understanding

The field of spatial intelligence and 3D scene understanding is rapidly advancing, with a focus on developing models that can perceive and act in the real world based on natural language instructions. Recent research has introduced novel frameworks and benchmarks for embodied spatial intelligence, omnidirectional spatial modeling, and 3D semantic occupancy representation. These advancements have improved the performance of multimodal large language models (MLLMs) in tasks such as spatial reasoning, visual question-answering, and trajectory planning. Notably, the development of fully quantized multi-agent systems and reconstructive geometry instruction tuning frameworks has enhanced the efficiency and scalability of MLLMs in real-world applications. Furthermore, surveys on panoramic vision have highlighted the challenges and opportunities in adapting perspective methods to omnidirectional images, and have identified open challenges and future directions in data, models, and applications. Some noteworthy papers in this area include: Beyond Pixels, which proposes a novel cross-modal alignment method for enhancing generalization in unseen scenes. Text-to-Layout, which presents a generative workflow for drafting architectural floor plans using large language models. Embodied Spatial Intelligence, which introduces a framework for creating robots that can perceive and act in the real world based on natural language instructions. Omnidirectional Spatial Modeling from Correlated Panoramas, which introduces a benchmark dataset for cross-frame correlated panoramas visual question answering. Reg3D, which proposes a reconstructive geometry instruction tuning framework for 3D scene understanding. QuantV2X, which introduces a fully quantized multi-agent system for cooperative perception. OccVLA, which proposes a vision-language-action model with implicit 3D occupancy supervision. Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes, which introduces a benchmark for evaluating the spatial reasoning abilities of vision-language models. Semantic Causality-Aware Vision-Based 3D Occupancy Prediction, which proposes a novel causal loss for holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline.

Sources

Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment

Text-to-Layout: A Generative Workflow for Drafting Architectural Floor Plans Using LLMs

Embodied Spatial Intelligence: from Implicit Scene Modeling to Spatial Reasoning

Omnidirectional Spatial Modeling from Correlated Panoramas

Understanding Space Is Rocket Science - Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture

Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding

QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception

Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models

SliceSemOcc: Vertical Slice Based Multimodal 3D Semantic Occupancy Representation

One Flight Over the Gap: A Survey from Perspective to Panoramic Vision

OccVLA: Vision-Language-Action Model with Implicit 3D Occupancy Supervision

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

Built with on top of