Advances in 3D Spatial Understanding with Large Language Models

The field of 3D spatial understanding is rapidly advancing with the integration of large language models (LLMs). Recent developments have shown that LLMs can be fine-tuned to achieve state-of-the-art results in various 3D understanding tasks, such as human activity recognition, spatial reasoning, and object affordance grounding. Notably, researchers have proposed innovative methods to leverage point clouds, vision-language models, and counterfactual reasoning to enhance 3D spatial understanding. These advancements have significant implications for real-world applications, including autonomous driving, robotics, and virtual reality.

Some noteworthy papers in this area include:

Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding, which achieved a 129x improvement in fine-grained human activity recognition using LLMs.
NuScenes-SpatialQA, a benchmark for evaluating the spatial understanding and reasoning capabilities of vision-language models in autonomous driving.
OmniDrive, a holistic vision-language dataset for autonomous driving with counterfactual reasoning, which demonstrated significant improvements in decision-making and planning.
The Point, the Vision and the Text, which comprehensively evaluated the role of point clouds in 3D spatial reasoning and found that LLMs without point input could achieve competitive performance.
IAAO, a novel framework for interactive affordance learning for articulated objects in 3D environments, which enabled robust affordance-based interaction and manipulation of objects.

Advances in 3D Spatial Understanding with Large Language Models

Sources