Mitigating Hallucinations in Multimodal Large Language Models

The field of multimodal large language models is moving towards addressing the persistent challenge of hallucinations, which occur when models incorrectly perceive or generate information. Recent research has focused on developing innovative methods to mitigate hallucinations, including the use of visual-semantic attention potential fields, preference optimization frameworks, and gradient-based self-reflection. These approaches aim to improve the models' ability to accurately perceive and generate information, and to reduce the reliance on textual cues. Notable papers in this area include: Two Causes, Not One: Rethinking Omission and Fabrication Hallucinations in MLLMs, which proposes a novel framework to mitigate hallucinations by demonstrating that omission and fabrication hallucinations have different causes. OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination, which introduces a preference-alignment framework to mitigate hallucinations in omni-modal large language models. Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model, which proposes a novel vision tower framework to improve the models' visual perception capabilities. No More Sibling Rivalry: Debiasing Human-Object Interaction Detection, which identifies and addresses the issue of toxic sibling bias in human-object interaction detection. Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens, which discovers a specific subset of neurons that signal the visual absence of tokens and proposes a method to refine the models' outputs. Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection, which proposes a gradient-based self-reflection method to estimate the influence of token types and mitigate hallucinations. Detecting Regional Spurious Correlations in Vision Transformers via Token Discarding, which presents a novel method to detect spurious correlations in vision transformers. Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance, which proposes a novel framework to empower large vision-language models with robust and coherent multi-turn visual-textual inference capabilities.

Mitigating Hallucinations in Multimodal Large Language Models

Sources