Mitigating Hallucinations in Large Vision-Language Models

The field of large vision-language models is moving towards addressing the critical issue of hallucinations, which pose substantial risks in safety-critical AI applications. Researchers are developing innovative evaluation benchmarks and detection methods to identify and mitigate hallucinations, particularly in perception and reasoning capabilities. The introduction of novel components, such as heuristic question answering and contrastive sentence rating, is enriching and calibrating image captions to generate more accurate and informative descriptions. Furthermore, the development of training-free decoding frameworks, like Dynamic Logits Calibration, is dynamically aligning text generation with visual evidence at inference time, reducing hallucinations and enhancing the reliability of large vision-language models. Noteworthy papers include: ScaleCap, which proposes a scalable debiased captioning strategy that generates comprehensive and detailed image captions. Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration, which introduces a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time.

Sources

Visual hallucination detection in large vision-language models via evidential conflict

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models

Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

Built with on top of