Hallucination Mitigation in Multimodal Models

The field of multimodal large language models is moving towards addressing the issue of hallucinations, where models generate text that is not grounded in the provided visual content. Researchers are exploring various approaches to mitigate this problem, including layer-wise integration and suppression, diffusion-based counterfactual generation, and token-adaptive preference strategies. These methods aim to improve the consistency and accuracy of model outputs, and have shown promising results in reducing hallucination rates and improving performance on various benchmarks. Notable papers in this area include LISA, which proposes a layer-wise integration and suppression approach to enhance generation consistency, and TARS, which introduces a token-adaptive preference strategy to reduce hallucinations. Additionally, ViHallu presents a vision-centric hallucination mitigation framework that enhances visual-semantic alignment through visual variation image generation and visual instruction construction. These innovative approaches are advancing the field and have the potential to improve the reliability and effectiveness of multimodal models.

Hallucination Mitigation in Multimodal Models

Sources