Mitigating Hallucinations in Large Vision-Language Models

The field of Large Vision-Language Models (LVLMs) is moving towards addressing the critical issue of hallucinations, where models generate outputs inconsistent with visual inputs. Researchers are proposing innovative approaches to mitigate hallucinations, including attention-inspired adaptive decoding strategies, ensemble decoding methods, and object-centric visual tokenization. These approaches aim to improve the robustness and accuracy of LVLMs in tasks such as image captioning and visual question answering. Notably, some papers are making significant contributions to the field, such as proposing novel decoding strategies that actively decide when to apply contrasting layers during generation, or introducing object-centric visual tokenization methods that can encode local visual details while maintaining high-level semantics. For example, the paper on Mixture of Decoding proposes a novel approach that dynamically adapts decoding strategies to mitigate hallucinations, while the paper on Slot-MLLM introduces an object-centric visual tokenizer that aligns with textual data to be integrated seamlessly within a unified next-token prediction framework. The paper on VaLSe proposes a Vision-aware Latent Steering framework that adopts an interpretation-then-mitigation strategy to address object hallucinations, and the paper on ActLCD proposes a novel decoding strategy that actively decides when to apply contrasting layers during generation. These papers are particularly noteworthy for their innovative approaches to addressing the challenge of hallucinations in LVLMs.

Mitigating Hallucinations in Large Vision-Language Models

Sources