The field of 3D spatial understanding is rapidly advancing with the integration of large language models (LLMs). Recent developments have shown that LLMs can be fine-tuned to achieve state-of-the-art results in various 3D understanding tasks, such as human activity recognition, spatial reasoning, and object affordance grounding. Notably, researchers have proposed innovative methods to leverage point clouds, vision-language models, and counterfactual reasoning to enhance 3D spatial understanding. These advancements have significant implications for real-world applications, including autonomous driving, robotics, and virtual reality.
Some noteworthy papers in this area include:
- Exploring the Capabilities of LLMs for IMU-based Fine-grained Human Activity Understanding, which achieved a 129x improvement in fine-grained human activity recognition using LLMs.
- NuScenes-SpatialQA, a benchmark for evaluating the spatial understanding and reasoning capabilities of vision-language models in autonomous driving.
- OmniDrive, a holistic vision-language dataset for autonomous driving with counterfactual reasoning, which demonstrated significant improvements in decision-making and planning.
- The Point, the Vision and the Text, which comprehensively evaluated the role of point clouds in 3D spatial reasoning and found that LLMs without point input could achieve competitive performance.
- IAAO, a novel framework for interactive affordance learning for articulated objects in 3D environments, which enabled robust affordance-based interaction and manipulation of objects.