The field of remote sensing is witnessing significant developments with a focus on multimodal analysis and bi-temporal change understanding. Researchers are exploring the integration of image and text modalities to enhance accuracy and robustness in change detection and captioning tasks. The use of large language models and multimodal fusion techniques is becoming increasingly popular, enabling more accurate and interpretable results. Noteworthy papers in this area include RSCC, which introduces a large-scale dataset for disaster events, and MMChange, which proposes a multimodal feature fusion network for remote sensing change detection. BTCChat is also notable for its advanced bi-temporal change understanding capability using a multi-temporal large language model. Additionally, a two-stage context learning approach with large language models has shown promising results for multimodal stance detection on climate change.