Mitigating Hallucinations in Multimodal Large Language Models

The field of multimodal large language models is witnessing significant advancements in mitigating hallucinations, a critical issue that affects the reliability of these models in practical applications. Researchers are exploring innovative approaches to address this challenge, including preference learning methods, attention intervention techniques, and theory-consistent symmetric multimodal preference optimization. These methods aim to align visual and linguistic representations, filter out irrelevant signals, and correct hallucinations by focusing on targeted areas where they occur. Notable papers in this area include CLAIM, which proposes a near training-free method to mitigate multilingual object hallucination, and ASCD, which introduces an attention-steerable contrastive decoding framework to reduce hallucinations. Overall, the field is moving towards developing more robust and reliable multimodal large language models that can accurately identify and correct hallucinations, leading to improved performance in various downstream tasks.

Sources

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention

Stop learning it all to mitigate visual hallucination, Focus on the hallucination target

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization

How Visual Representations Map to Language Feature Space in Multimodal LLMs

Investigating the interaction of linguistic and mathematical reasoning in language models using multilingual number puzzles

Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text

Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent

ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM

Lost in Variation? Evaluating NLI Performance in Basque and Spanish Geographical Variants

Demystifying the Visual Quality Paradox in Multimodal Large Language Models