The field of medical imaging and vision-language models is rapidly evolving, with a focus on developing more accurate and efficient models for tasks such as image segmentation, disease diagnosis, and report generation. Recent work has emphasized the importance of domain-specific foundation models, which can provide a unified framework for multiple clinical tasks and improve performance on downstream tasks. Additionally, there is a growing interest in vision-language pretraining, which enables representation learning from large-scale image-text pairs without relying on expensive manual annotations. Notable papers in this area include Mammo-FM, which introduces a breast-specific foundational model for integrated mammographic diagnosis, prognosis, and reporting, and STAMP, which resolves the trilemma in MLLM-based segmentation with simultaneous textual mask prediction. Other noteworthy papers include MedGrounder, which achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines, and Panel2Patch, which enables granularity-aware pretraining and achieves substantially better performance with less pretraining data.
Advances in Medical Imaging and Vision-Language Models
Sources
Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting
Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning