Advances in Multimodal Image Understanding

The field of multimodal image understanding is rapidly evolving, with a focus on developing innovative approaches to bridge the gap between visual and textual modalities. Recent studies have introduced novel frameworks and techniques to enhance image captioning, image retrieval, and image quality assessment. These advancements have the potential to significantly impact various applications, including news reporting, digital archiving, and tourism. Notably, the integration of event-aware and semantic-aware techniques has improved the accuracy and relevance of image captions. Furthermore, the use of multimodal large language models and visual prompts has shown promising results in image quality assessment and aesthetic image captioning. Overall, the field is moving towards more sophisticated and context-aware image understanding systems. Noteworthy papers include: EVENT-Retriever, which achieved top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, and Aesthetic Saliency Enhanced Multimodal Large Language Model, which achieved state-of-the-art performance on current mainstream AIC benchmarks.

Sources

Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters

EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization

Parameter-Efficient Adaptation of mPLUG-Owl2 via Pixel-Level Visual Prompts for NR-IQA

A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters: A Case Study in Shanghai

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

Aesthetic Image Captioning with Saliency Enhanced MLLMs

Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Built with on top of