Hallucination Mitigation in Multimodal Models

The field of multimodal large language models is moving towards addressing the issue of hallucinations, where models generate text that is not grounded in the provided visual content. Researchers are exploring various approaches to mitigate this problem, including layer-wise integration and suppression, diffusion-based counterfactual generation, and token-adaptive preference strategies. These methods aim to improve the consistency and accuracy of model outputs, and have shown promising results in reducing hallucination rates and improving performance on various benchmarks. Notable papers in this area include LISA, which proposes a layer-wise integration and suppression approach to enhance generation consistency, and TARS, which introduces a token-adaptive preference strategy to reduce hallucinations. Additionally, ViHallu presents a vision-centric hallucination mitigation framework that enhances visual-semantic alignment through visual variation image generation and visual instruction construction. These innovative approaches are advancing the field and have the potential to improve the reliability and effectiveness of multimodal models.

Sources

LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

Causality-aligned Prompt Learning via Diffusion-based Counterfactual Generation

Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering

TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

Enhancing Generalization in Data-free Quantization via Mixup-class Prompting

See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs

Built with on top of