Emerging Trends in 3D Vision and Spatial Reasoning

The field of 3D vision and spatial reasoning is rapidly advancing, driven by innovations in deep learning and computer vision. Recent developments have focused on improving the ability of models to understand and reason about 3D spaces, with applications in areas such as robotics, autonomous vehicles, and virtual reality. A key direction in this field is the development of models that can learn to think in space and time, enabling them to better understand and navigate complex environments. Another important trend is the integration of multimodal learning, where models are trained on multiple sources of data, such as images, videos, and text, to improve their ability to reason and understand the world. Notable papers in this area include SPIDER, which introduces a universal feature matching framework for robust calibration, and C3Po, which presents a new dataset and model for cross-view cross-modality correspondence. The Disc3D pipeline has also shown promising results in generating high-quality 3D dialogue data, while LAST has demonstrated the effectiveness of learning to think in space and time for generalist vision-language models. MapFormer has introduced a new architecture for learning cognitive maps, and Ref-SAM3D has extended SAM3D for text-guided 3D reconstruction. Other noteworthy papers include VLM^2, LocateAnything3D, and G^2VLM, which have all made significant contributions to the field of 3D vision and spatial reasoning.

Sources

SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration

C3Po: Cross-View Cross-Modality Correspondence by Pointmap Prediction

Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

3D Motion Perception of Binocular Vision Target with PID-CNN

DINO-Tok: Adapting DINO for Visual Tokenizers

Vision-Language Memory for Spatial Reasoning

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Built with on top of