The field of multimodal large language models (MLLMs) is moving towards more grounded and explicit knowledge representation through the development of multimodal knowledge graphs (MMKGs). These graphs aim to complement the implicit knowledge of MLLMs and enable more effective retrieval-augmented generation. Recent research has focused on creating more comprehensive and extensible MMKGs that incorporate multiple modalities, such as visual, audio, and text information.
Noteworthy papers in this area include VAT-KG, which proposes a novel concept-centric and knowledge-intensive multimodal knowledge graph, and MANTA, which introduces a theoretically-grounded framework for cross-modal semantic alignment and information-theoretic optimization. MOTOR is also notable for its multimodal retrieval and re-ranking approach in medical visual question answering. These papers demonstrate significant improvements in performance and accuracy, highlighting the potential of MMKGs and MLLMs in various applications, including educational textbook question answering, document understanding, and long-form multimodal understanding.