Advances in 3D Spatial Understanding with Large Language Models

The field of 3D spatial understanding is rapidly advancing with the integration of large language models (LLMs). Recent developments have shown that LLMs can be fine-tuned to achieve state-of-the-art results in various 3D understanding tasks, such as human activity recognition, spatial reasoning, and object affordance grounding. Notably, researchers have proposed innovative methods to leverage point clouds, vision-language models, and counterfactual reasoning to enhance 3D spatial understanding. These advancements have significant implications for real-world applications, including autonomous driving, robotics, and virtual reality.

Some noteworthy papers in this area include:

  • Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding, which achieved a 129x improvement in fine-grained human activity recognition using LLMs.
  • NuScenes-SpatialQA, a benchmark for evaluating the spatial understanding and reasoning capabilities of vision-language models in autonomous driving.
  • OmniDrive, a holistic vision-language dataset for autonomous driving with counterfactual reasoning, which demonstrated significant improvements in decision-making and planning.
  • The Point, the Vision and the Text, which comprehensively evaluated the role of point clouds in 3D spatial reasoning and found that LLMs without point input could achieve competitive performance.
  • IAAO, a novel framework for interactive affordance learning for articulated objects in 3D environments, which enabled robust affordance-based interaction and manipulation of objects.

Sources

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?

OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

PanoDreamer: Consistent Text to 360-Degree Scene Generation

Imperative vs. Declarative Programming Paradigms for Open-Universe Scene Generation

How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM

IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments

Built with on top of