Multimodal Remote Sensing Advances

The field of remote sensing is moving towards the integration of multiple input modalities to improve data efficiency and out-of-distribution generalization. Recent research has shown that combining optical imagery with other geographic data layers can significantly enhance machine learning model performance, particularly in settings with limited labeled data. The use of unified foundation models, such as those employing a single transformer backbone to handle multiple modalities, has also demonstrated impressive generalization abilities. Additionally, novel architectures that address modality misalignment and redundancy have been proposed, achieving state-of-the-art performance in semantic segmentation tasks. Noteworthy papers include:

SkySense V2, which presents a unified foundation model that outperforms its predecessor by an average of 1.8 points across 16 datasets.
AMMNet, which introduces an asymmetric architecture that achieves robust and efficient semantic segmentation through modality-specific design.
Met$^2$Net, which proposes a decoupled two-stage spatio-temporal forecasting model that captures inter-variable interactions and achieves state-of-the-art performance in weather prediction tasks.

Multimodal Remote Sensing Advances

Sources