The field of multimodal question answering is moving towards more accurate and reliable models, with a focus on mitigating hallucinations and improving performance in real-world applications. Recent developments have shown that retrieval-augmented generation (RAG) and vision-language models (VLMs) can be effective in understanding and reasoning about complex questions, including those that require visual context comprehension and multi-source retrieval. The use of knowledge graphs and graph-based approaches has also been explored, demonstrating potential for improving performance in safety-critical tasks such as autonomous driving. Noteworthy papers include:
- SafeDriveRAG, which proposes a knowledge graph-based RAG approach for visual question answering in autonomous driving scenarios, achieving significant performance gains in traffic safety tasks.
- Solution for Meta KDD Cup'25, which describes a comprehensive three-step framework for vision question answering, achieving top rankings in the CRAG-MM challenge.
- Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG, which presents a robust framework for minimizing hallucinations in multimodal RAG systems, demonstrating effectiveness in the KDD Cup 2025 challenge.