Advances in Open-Vocabulary Object Detection and 3D Scene Understanding

The field of computer vision is witnessing significant advancements in open-vocabulary object detection and 3D scene understanding. Recent developments have enabled the detection of previously unseen objects through natural language descriptions, enhancing the intelligence and autonomy of systems in aerial scene understanding. Moreover, innovative approaches are being explored to improve spatial reasoning capabilities of vision-language models, leveraging techniques such as chain-of-thought prompting and reinforcement learning. The integration of 2D foundation models with 3D perception is also showing promise, allowing for scalable and open-vocabulary 3D object detection without human-annotated labels. Noteworthy papers include:

  • Open-Vocabulary Object Detection in UAV Imagery, which presents a comprehensive survey of open-vocabulary object detection methods for aerial imagery.
  • Just Add Geometry, which achieves competitive localization performance in open-vocabulary 3D object detection using a 2D vision-language detector and geometric inflation strategy.
  • Descrip3D, which enhances large language model-based 3D scene understanding with object-level text descriptions, demonstrating improved performance on various benchmark datasets.
  • DiSCO-3D, which addresses the problem of 3D open-vocabulary sub-concepts discovery, achieving effective performance and state-of-the-art results in edge cases.
  • Spatial 3D-LLM, which proposes a 3D multimodal large language model that enhances spatial awareness for 3D vision-language tasks, achieving state-of-the-art performance across various tasks.

Sources

Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop

Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

An aerial color image anomaly dataset for search missions in complex forested terrain

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension

From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding

Built with on top of