Multimodal Earth Observation Systems

The field of multimodal Earth observation is moving towards more inclusive and scalable systems, with a focus on training-free and multilingual approaches. Recent developments have shown that retrieval-augmented prompting and generative editing in the joint vision-language space can significantly improve the performance of image captioning and retrieval tasks. The use of large language models and vision-language models has also been explored for zero-shot object counting and image clustering, demonstrating their potential for advancing the field. Notably, the incorporation of semantic alignment and optimal transport has led to improved clustering performance, while the introduction of multimodal fusion feature editing has enhanced the effectiveness of zero-shot composed image retrieval.

Noteworthy papers include: The paper on Multilingual Training-Free Remote Sensing Image Captioning, which proposes a novel approach that achieves competitive results with fully supervised English-only systems and generalizes to other languages. The paper on Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval, which introduces a novel framework that significantly outperforms prior zero-shot approaches. The paper on Object Counting with GPT-4o and GPT-5, which demonstrates the potential of large language models for zero-shot object counting using only textual prompts.

Sources

Multilingual Training-Free Remote Sensing Image Captioning

Hierarchical Semantic Alignment for Image Clustering

Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval

Object Counting with GPT-4o and GPT-5: A Comparative Study

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Built with on top of