Advances in Mitigating Hallucinations in Visual Question Answering

The field of Visual Question Answering (VQA) is moving towards addressing the challenge of hallucinations, which are incorrect responses that contradict input images. Researchers are exploring innovative approaches to detect and mitigate hallucinations, including the use of diffusion models, instruction-aligned visual attention, and self-awareness mechanisms. These advances have the potential to improve the reliability and trustworthiness of VQA systems, particularly in high-stakes applications such as medical diagnosis. Notable papers in this area include:

DiN, which introduces a diffusion model to handle noisy labels in Med-VQA.
IAVA, which proposes an instruction-aligned visual attention approach to mitigate hallucinations in large vision-language models.
VASE, which incorporates weak image transformations and amplifies the impact of visual input to improve hallucination detection in medical VQA.
SAFEQA, which utilizes image features, salient region features, and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks.

Advances in Mitigating Hallucinations in Visual Question Answering

Sources