Advances in 3D Scene Generation and Spatial Reasoning

The field of 3D scene generation and spatial reasoning is rapidly advancing, with a focus on developing more realistic and interactive environments. Recent research has explored the use of large language models and multimodal learning to improve scene generation and understanding. One notable direction is the development of frameworks that can generate realistic 3D scenes by leveraging the underlying structure of rooms and learning the variation of object poses from real-world scenes. Another area of research is the creation of benchmarks and datasets to evaluate the spatial reasoning abilities of vision-language models, with a focus on assessing their ability to perform tasks such as object relation reasoning and geolocation. Noteworthy papers in this area include: From Programs to Poses, which proposes a framework for generating realistic 3D scenes by leveraging the underlying structure of rooms and learning the variation of object poses from real-world scenes. Real2USD, which demonstrates the effectiveness of using the Universal Scene Description language to represent geometric, photometric, and semantic information in the environment for LLM-based robotics tasks. Where on Earth, which presents a comprehensive benchmark for evaluating the geolocation skills of vision-language models across different scales and settings. Situat3DChange, which introduces a large-scale dataset for situated 3D change understanding and proposes an efficient 3D MLLM approach for point cloud comparison. Prompt-Guided Spatial Understanding, which enhances spatial comprehension by embedding mask dimensions into input prompts and achieves state-of-the-art results on the Physical AI Spatial Intelligence Warehouse dataset. IL3D, which presents a large-scale indoor layout dataset for LLM-driven 3D scene generation and establishes rigorous benchmarks to evaluate LLM-driven scene generation. Spatial-DISE, which proposes a unified benchmark for evaluating spatial reasoning in vision-language models and develops a scalable pipeline to generate diverse and verifiable spatial reasoning questions. Reasoning in Space, which introduces the Grounded-Spatial Reasoner to explore effective spatial representations that bridge the gap between 3D visual grounding and spatial reasoning. QuASH, which proposes a solution to the challenge of querying visual-language robotic maps using natural-language heuristics.

Sources

From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries

Real2USD: Scene Representations in Universal Scene Description Language

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation

Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models

Reasoning in Space via Grounding in the World

QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps

Built with on top of