The field of medical imaging analysis is rapidly advancing, with a focus on developing innovative and effective methods for image classification, segmentation, and retrieval. Recent research has explored the use of self-supervised learning, multimodal fusion, and vision-language modeling to improve the accuracy and efficiency of medical imaging analysis. These approaches have shown promising results in various applications, including disease detection, tumor segmentation, and image captioning. Notably, the development of large-scale datasets and benchmarks has facilitated the evaluation and comparison of different methods, driving progress in the field.
Some noteworthy papers in this area include: M3Ret, which presents a unified visual encoder for multimodal medical image retrieval, achieving state-of-the-art performance in zero-shot image-to-image retrieval across various modalities. MedVista3D, which introduces a multi-scale semantic-enriched vision-language pretraining framework for 3D CT analysis, demonstrating state-of-the-art performance in zero-shot disease classification, report retrieval, and medical visual question answering. CLAPS, which proposes a CLIP-unified auto-prompt segmentation method for multi-modal retinal imaging, achieving performance on par with specialized expert models and surpassing existing benchmarks across most metrics.