Mitigating Hallucinations in Multimodal Large Language Models

The field of multimodal large language models (MLLMs) is moving towards addressing the critical issue of hallucinations, where models fabricate details inconsistent with image content. Recent developments have focused on introducing novel approaches to enhance MLLM visual factual consistency, such as integrating explicit factual signals and leveraging pixel-level grounding capabilities. These innovations have shown significant improvements in mitigating hallucinations without compromising general understanding and reasoning abilities. Noteworthy papers include Grounded Visual Factualization, which introduces a factual anchor-based finetuning method, and VBackChecker, a reference-free hallucination detection framework that verifies the consistency of MLLM-generated responses with visual inputs. Additionally, Spectral Representation Filtering has been proposed as a lightweight, training-free method to suppress hallucinations by analyzing and correcting the covariance structure of the model's representations.

Sources

Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding

Suppressing VLM Hallucinations with Spectral Representation Filtering

What Color Is It? A Text-Interference Multimodal Hallucination Benchmark

VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task

Built with on top of