Mitigating Hallucinations in Large Vision-Language Models

The field of Large Vision-Language Models (LVLMs) is moving towards addressing the critical issue of hallucinations, where models generate outputs inconsistent with visual inputs. Researchers are proposing innovative approaches to mitigate hallucinations, including attention-inspired adaptive decoding strategies, ensemble decoding methods, and object-centric visual tokenization. These approaches aim to improve the robustness and accuracy of LVLMs in tasks such as image captioning and visual question answering. Notably, some papers are making significant contributions to the field, such as proposing novel decoding strategies that actively decide when to apply contrasting layers during generation, or introducing object-centric visual tokenization methods that can encode local visual details while maintaining high-level semantics. For example, the paper on Mixture of Decoding proposes a novel approach that dynamically adapts decoding strategies to mitigate hallucinations, while the paper on Slot-MLLM introduces an object-centric visual tokenizer that aligns with textual data to be integrated seamlessly within a unified next-token prediction framework. The paper on VaLSe proposes a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address object hallucinations, and the paper on ActLCD proposes a novel decoding strategy that actively decides when to apply contrasting layers during generation. These papers are particularly noteworthy for their innovative approaches to addressing the challenge of hallucinations in LVLMs.

Sources

Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models

Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations

Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Built with on top of