Robustness and Reliability in Visual Language Models

The field of visual language models is moving towards a greater emphasis on robustness and reliability. Recent studies have highlighted the fragility of current models to minor perturbations, such as small changes in images or harmless rephrasings of questions. This has led to a growing recognition of the need for more comprehensive evaluations that go beyond traditional metrics. Researchers are now exploring new benchmarks and evaluation frameworks to assess the robustness of visual language models to various types of perturbations, including misleading visual inputs and unanswerable questions. Notable papers in this area include: Questioning the Stability of Visual Question Answering, which demonstrates the instability of modern visual language models to minor perturbations and proposes a new evaluation framework. MVI-Bench, which introduces a comprehensive benchmark for evaluating the robustness of large vision-language models to misleading visual inputs. Q-Doc, which proposes a three-tiered evaluation framework for assessing the document image quality assessment capabilities of multi-modal large language models.

Sources

Questioning the Stability of Visual Question Answering

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

Built with on top of