Advancements in Multimodal Question Answering

The field of multimodal question answering is moving towards more accurate and reliable models, with a focus on mitigating hallucinations and improving performance in real-world applications. Recent developments have shown that retrieval-augmented generation (RAG) and vision-language models (VLMs) can be effective in understanding and reasoning about complex questions, including those that require visual context comprehension and multi-source retrieval. The use of knowledge graphs and graph-based approaches has also been explored, demonstrating potential for improving performance in safety-critical tasks such as autonomous driving. Noteworthy papers include:

  • SafeDriveRAG, which proposes a knowledge graph-based RAG approach for visual question answering in autonomous driving scenarios, achieving significant performance gains in traffic safety tasks.
  • Solution for Meta KDD Cup'25, which describes a comprehensive three-step framework for vision question answering, achieving top rankings in the CRAG-MM challenge.
  • Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, which presents a robust framework for minimizing hallucinations in multimodal RAG systems, demonstrating effectiveness in the KDD Cup 2025 challenge.

Sources

RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Solution for Meta KDD Cup'25: A Comprehensive Three-Step Framework for Vision Question Answering

SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation

A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Built with on top of