The field of computer vision is witnessing significant advancements in open-vocabulary object detection and 3D scene understanding. Recent developments have enabled the detection of previously unseen objects through natural language descriptions, enhancing the intelligence and autonomy of systems in aerial scene understanding. Moreover, innovative approaches are being explored to improve spatial reasoning capabilities of vision-language models, leveraging techniques such as chain-of-thought prompting and reinforcement learning. The integration of 2D foundation models with 3D perception is also showing promise, allowing for scalable and open-vocabulary 3D object detection without human-annotated labels. Noteworthy papers include:
- Open-Vocabulary Object Detection in UAV Imagery, which presents a comprehensive survey of open-vocabulary object detection methods for aerial imagery.
- Just Add Geometry, which achieves competitive localization performance in open-vocabulary 3D object detection using a 2D vision-language detector and geometric inflation strategy.
- Descrip3D, which enhances large language model-based 3D scene understanding with object-level text descriptions, demonstrating improved performance on various benchmark datasets.
- DiSCO-3D, which addresses the problem of 3D open-vocabulary sub-concepts discovery, achieving effective performance and state-of-the-art results in edge cases.
- Spatial 3D-LLM, which proposes a 3D multimodal large language model that enhances spatial awareness for 3D vision-language tasks, achieving state-of-the-art performance across various tasks.