Advances in Vision-Language Understanding

The field of vision-language understanding is moving towards addressing the limitations of current models in accurately perceiving and extracting fine-grained structures from visual data. Researchers are focusing on developing novel frameworks and benchmarks to evaluate and improve the performance of vision-language models (VLMs) in various tasks such as data visualization understanding, image captioning, and chart grounding.

Noteworthy papers include: Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models, which investigates the sources of errors in VLMs and proposes a suite of tasks to characterize potential difficulties. VisJudge-Bench, a comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality, reveals significant gaps between current models and human experts. Top-Down Semantic Refinement for Image Captioning proposes a novel framework that models image captioning as a goal-oriented hierarchical refinement planning problem, achieving state-of-the-art results on multiple benchmarks. Other notable works include the introduction of DualCap, a lightweight image captioning approach that utilizes dual retrieval with similar scenes visual prompts, and the development of DiagramEval, a novel evaluation metric for assessing the quality of LLM-generated diagrams.

Sources

Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models

VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Top-Down Semantic Refinement for Image Captioning

DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts

Instance-Level Composed Image Retrieval

DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Masked Diffusion Captioning for Visual Feature Learning

Built with on top of