The field of image captioning is moving towards more accurate and descriptive caption generation, with a focus on self-correction and attention-guided approaches. Researchers are also exploring the use of retrieval-based objects and relations prompts to improve captioning performance. In the area of outdoor monitoring, there is a growing interest in developing uncertainty-aware multimodal fusion frameworks to detect early abnormal health status and improve visual geo-localization for drones in various weather conditions. Notable papers in this area include SC-Captioner, which proposes a reinforcement learning framework for self-correcting image caption models, and WeatherPrompt, which introduces a multi-modality learning paradigm for weather-invariant representations. Additionally, RORPCap and AGIC demonstrate promising results in image captioning, while DeepLight and DUAL-Health show potential in lightning prediction and outdoor health monitoring, respectively.