The field of multimodal learning is rapidly advancing, with a focus on improving the integration of information across different modalities, such as vision and language. Recent developments have highlighted the importance of aligning hierarchical features from text and images, and embedding them into hyperbolic manifolds to effectively model their structures. Additionally, there is a growing interest in applying multimodal learning to medical applications, such as visual question answering and medical report generation. Noteworthy papers in this area include: Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds, which proposes a method for aligning tree-like hierarchical features for both image and text modalities. VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering with spatial grounding. CMI-MTL, a Cross-Mamba Interaction based Multi-Task Learning framework that learns cross-modal feature representations from images and texts. Medical Report Generation, a Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework that tackles three main challenges in medical report generation. Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA, a multi-task framework that integrates three curated datasets for simultaneous visual question answering, explanation generation, and visual grounding.