Mitigating Hallucinations in Vision-Language Models

The field of vision-language models is moving towards addressing the challenge of hallucinations, which can significantly impact the accuracy and reliability of these models. Recent research has focused on developing innovative methods to mitigate hallucinations, including selective and contrastive decoding, autoregressive semantic visual reconstruction, and token-level localization of hallucinations. These approaches have shown promising results in improving the performance of vision-language models on various benchmarks. Noteworthy papers in this area include those that propose novel frameworks for mitigating semantic hallucination, reconstructing visual information from brain activity, and detecting hallucinations with graded confidence. For example, one paper introduces a training-free semantic hallucination mitigation framework that achieves strong performance on public benchmarks, while another paper proposes a hierarchical vision-to-image reconstruction method that effectively recovers highly complex visual stimuli. Overall, the field is making significant progress in addressing the challenge of hallucinations, and these innovative approaches are expected to have a significant impact on the development of more accurate and reliable vision-language models. Notable papers include: SECOND, which proposes a novel approach to mitigating perceptual hallucination in vision-language models via selective and contrastive decoding. ASVR, which introduces autoregressive semantic visual reconstruction to enhance image understanding in large vision-language models.

Sources

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

HalLoc: Token-level Localization of Hallucinations for Vision Language Models

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Built with on top of