The field of multimodal learning is rapidly advancing, with a focus on medical applications. Recent research has explored the use of large language models, vision-language models, and multimodal fusion techniques to improve performance on various medical tasks, such as disease diagnosis, image segmentation, and question answering. Notably, the development of specialized models like Med-GRIM, VL-MedGuide, and Doctor Sun has demonstrated significant improvements in medical visual question answering and image classification tasks. Furthermore, the introduction of datasets like Med-GLIP-5M and MM-Food-100K has facilitated the training and evaluation of multimodal models for medical applications. Overall, the field is moving towards more effective and interpretable multimodal learning approaches for medical decision support. Noteworthy papers include Med-GRIM, which achieved state-of-the-art performance on medical VQA tasks, and VL-MedGuide, which demonstrated strong performance on skin disease diagnosis and concept detection.
Advances in Multimodal Learning for Medical Applications
Sources
On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications
VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis
BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models
FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision
Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model
MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion