The field of multimodal learning is moving towards more efficient and effective methods for visual question answering (VQA). Recent developments focus on improving the internal representations of multimodal large language models (MLLMs) to guide the search for relevant image regions. This approach has led to significant performance improvements across various fine-grained VQA datasets and MLLMs. Noteworthy papers include FOCUS, which proposes a training-free visual cropping method that leverages MLLM-internal representations, and Visual Structures Helps Visual Reasoning, which introduces a simple yet effective intervention to augment visual inputs with low-level spatial structures. Other notable works include ReCo, which proposes a lightweight module to mitigate hallucinations in VLMs, and ONLY, which presents a one-layer intervention approach to mitigate hallucinations in LVLMs.
Advances in Multimodal Learning for Visual Question Answering
Sources
COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication
CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models