Mitigating Hallucinations in Multimodal Models

The field of multimodal research is moving towards addressing the issue of hallucinations in large vision-language models. Researchers are exploring innovative methods to refine textual embeddings, enforce evidential grounding, and improve faithfulness in multimodal reasoning. Notable advancements include the development of frameworks that integrate visual information to mitigate hallucinations and improve visual grounding.

Some papers are particularly noteworthy, including: Towards Mitigating Hallucinations in Large Vision-Language Models, which proposes a simple yet effective method to refine textual embeddings by integrating average-pooled visual features. FaithAct: Faithfulness Planning and Acting in MLLMs, which introduces a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs, which proposes a causally-grounded framework that models hallucination process via a structural causal graph. Taming Object Hallucinations with Verified Atomic Confidence Estimation, which introduces a simple framework that mitigates hallucinations through self-verification and confidence calibration.

Mitigating Hallucinations in Multimodal Models

Sources