Multimodal Knowledge Graphs and Large Language Models

The field of multimodal large language models (MLLMs) is moving towards more grounded and explicit knowledge representation through the development of multimodal knowledge graphs (MMKGs). These graphs aim to complement the implicit knowledge of MLLMs and enable more effective retrieval-augmented generation. Recent research has focused on creating more comprehensive and extensible MMKGs that incorporate multiple modalities, such as visual, audio, and text information.

Noteworthy papers in this area include VAT-KG, which proposes a novel concept-centric and knowledge-intensive multimodal knowledge graph, and MANTA, which introduces a theoretically-grounded framework for cross-modal semantic alignment and information-theoretic optimization. MOTOR is also notable for its multimodal retrieval and re-ranking approach in medical visual question answering. These papers demonstrate significant improvements in performance and accuracy, highlighting the potential of MMKGs and MLLMs in various applications, including educational textbook question answering, document understanding, and long-form multimodal understanding.

Sources

VAT-KG: Knowledge-Intensive Multimodal Knowledge Graph Dataset for Retrieval-Augmented Generation

Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Structured Attention Matters to Multimodal LLMs in Document Understanding

MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering

MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding

Chart Question Answering from Real-World Analytical Narratives

Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models

Built with on top of