Mitigating Hallucinations in Large Vision-Language Models

The field of large vision-language models is moving towards addressing the critical issue of hallucinations, which pose substantial risks in safety-critical AI applications. Researchers are developing innovative evaluation benchmarks and detection methods to identify and mitigate hallucinations, particularly in perception and reasoning capabilities. The introduction of novel components, such as heuristic question answering and contrastive sentence rating, is enriching and calibrating image captions to generate more accurate and informative descriptions. Furthermore, the development of training-free decoding frameworks, like Dynamic Logits Calibration, is dynamically aligning text generation with visual evidence at inference time, reducing hallucinations and enhancing the reliability of large vision-language models. Noteworthy papers include: ScaleCap, which proposes a scalable debiased captioning strategy that generates comprehensive and detailed image captions. Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration, which introduces a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time.

Mitigating Hallucinations in Large Vision-Language Models

Sources