Advances in Multimodal Learning for Visual Question Answering

The field of multimodal learning is moving towards more efficient and effective methods for visual question answering (VQA). Recent developments focus on improving the internal representations of multimodal large language models (MLLMs) to guide the search for relevant image regions. This approach has led to significant performance improvements across various fine-grained VQA datasets and MLLMs. Noteworthy papers include FOCUS, which proposes a training-free visual cropping method that leverages MLLM-internal representations, and Visual Structures Helps Visual Reasoning, which introduces a simple yet effective intervention to augment visual inputs with low-level spatial structures. Other notable works include ReCo, which proposes a lightweight module to mitigate hallucinations in VLMs, and ONLY, which presents a one-layer intervention approach to mitigate hallucinations in LVLMs.

Sources

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

MDC-R: The Minecraft Dialogue Corpus with Reference

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

ReCo: Reminder Composition Mitigates Hallucinations in Vision-Language Models

Efficient Multi-Crop Saliency Partitioning for Automatic Image Cropping

Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Room Scene Discovery and Grouping in Unstructured Vacation Rental Image Collections

ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Digital Collections Explorer: An Open-Source, Multimodal Viewer for Searching Digital Collections

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy

HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision