Advances in Multimodal Learning for Medical Applications

The field of multimodal learning is rapidly advancing, with a focus on improving the integration of information across different modalities, such as vision and language. Recent developments have highlighted the importance of aligning hierarchical features from text and images, and embedding them into hyperbolic manifolds to effectively model their structures. Additionally, there is a growing interest in applying multimodal learning to medical applications, such as visual question answering and medical report generation. Noteworthy papers in this area include: Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds, which proposes a method for aligning tree-like hierarchical features for both image and text modalities. VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering with spatial grounding. CMI-MTL, a Cross-Mamba Interaction based Multi-Task Learning framework that learns cross-modal feature representations from images and texts. Medical Report Generation, a Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework that tackles three main challenges in medical report generation. Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA, a multi-task framework that integrates three curated datasets for simultaneous visual question answering, explanation generation, and visual grounding.

Sources

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA

OUNLP at TSAR 2025 Shared Task: Multi-Round Text Simplifier via Code Generation

Built with on top of