The field of vision-language understanding is moving towards addressing the limitations of current models in accurately perceiving and extracting fine-grained structures from visual data. Researchers are focusing on developing novel frameworks and benchmarks to evaluate and improve the performance of vision-language models (VLMs) in various tasks such as data visualization understanding, image captioning, and chart grounding.
Noteworthy papers include: Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models, which investigates the sources of errors in VLMs and proposes a suite of tasks to characterize potential difficulties. VisJudge-Bench, a comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality, reveals significant gaps between current models and human experts. Top-Down Semantic Refinement for Image Captioning proposes a novel framework that models image captioning as a goal-oriented hierarchical refinement planning problem, achieving state-of-the-art results on multiple benchmarks. Other notable works include the introduction of DualCap, a lightweight image captioning approach that utilizes dual retrieval with similar scenes visual prompts, and the development of DiagramEval, a novel evaluation metric for assessing the quality of LLM-generated diagrams.