Mitigating Hallucinations in Multimodal Models

The field of multimodal research is moving towards addressing the issue of hallucinations in large vision-language models. Researchers are exploring innovative methods to refine textual embeddings, enforce evidential grounding, and improve faithfulness in multimodal reasoning. Notable advancements include the development of frameworks that integrate visual information to mitigate hallucinations and improve visual grounding.

Some papers are particularly noteworthy, including: Towards Mitigating Hallucinations in Large Vision-Language Models, which proposes a simple yet effective method to refine textual embeddings by integrating average-pooled visual features. FaithAct: Faithfulness Planning and Acting in MLLMs, which introduces a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs, which proposes a causally-grounded framework that models hallucination process via a structural causal graph. Taming Object Hallucinations with Verified Atomic Confidence Estimation, which introduces a simple framework that mitigates hallucinations through self-verification and confidence calibration.

Sources

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

FaithAct: Faithfulness Planning and Acting in MLLMs

Structured Uncertainty guided Clarification for LLM Agents

"It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs

EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI

Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

Taming Object Hallucinations with Verified Atomic Confidence Estimation

Built with on top of