Advances in Multimodal Learning for Medical Applications

The field of multimodal learning is rapidly advancing, with a focus on medical applications. Recent research has explored the use of large language models, vision-language models, and multimodal fusion techniques to improve performance on various medical tasks, such as disease diagnosis, image segmentation, and question answering. Notably, the development of specialized models like Med-GRIM, VL-MedGuide, and Doctor Sun has demonstrated significant improvements in medical visual question answering and image classification tasks. Furthermore, the introduction of datasets like Med-GLIP-5M and MM-Food-100K has facilitated the training and evaluation of multimodal models for medical applications. Overall, the field is moving towards more effective and interpretable multimodal learning approaches for medical decision support. Noteworthy papers include Med-GRIM, which achieved state-of-the-art performance on medical VQA tasks, and VL-MedGuide, which demonstrated strong performance on skin disease diagnosis and concept detection.

Sources

PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering

Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG

Large Language Models Facilitate Vision Reflection in Image Classification

On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis

BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification

Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

The Medical Metaphors Corpus (MCC)

Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition

MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Capabilities of GPT-5 on Multimodal Medical Reasoning

Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model

Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI

AME: Aligned Manifold Entropy for Robust Vision-Language Distillation

MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion

GazeLT: Visual attention-guided long-tailed disease classification in chest radiographs

Multi-Contrast Fusion Module: An attention mechanism integrating multi-contrast features for fetal torso plane classification

Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Interpretable Oracle Bone Script Decipherment through Radical and Pictographic Analysis with LVLMs

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

MM-Food-100K: A 100,000-Sample Multimodal Food Intelligence Dataset with Verifiable Provenance

Med-GLIP: Advancing Medical Language-Image Pre-training with Large-scale Grounded Dataset

Performance of GPT-5 in Brain Tumor MRI Reasoning

Medico 2025: Visual Question Answering for Gastrointestinal Imaging