The field of remote sensing is witnessing significant advancements in object detection and vision-language modeling. Researchers are exploring the fusion of optical and SAR images to improve detection accuracy, particularly in complex environments. The development of large-scale, standardized datasets and benchmarking toolkits is facilitating the evaluation and comparison of different methods. Additionally, vision-language models are being applied to remote sensing tasks, such as image-text retrieval and visual question answering, with a focus on learning image and language alignments from large datasets. The use of multi-modal and multi-resolution approaches is also becoming increasingly popular, enabling the extraction of complementary information from different image modalities. Notable papers in this area include:
- M4-SAR, which introduces a comprehensive dataset for optical-SAR fusion object detection and proposes a novel end-to-end multi-source fusion detection framework.
- Vision-Language Modeling Meets Remote Sensing, which provides a comprehensive review of vision-language modeling in remote sensing and discusses future research directions.
- Visual Question Answering on Multiple Remote Sensing Image Modalities, which proposes a new VQA dataset and model for effectively combining multiple image modalities and text.
- InstructSAM, which introduces a training-free framework for instruction-driven object recognition in remote sensing imagery.