Advances in Medical Imaging and Vision-Language Models

The field of medical imaging and vision-language models is rapidly evolving, with a focus on developing more accurate and efficient models for tasks such as image segmentation, disease diagnosis, and report generation. Recent work has emphasized the importance of domain-specific foundation models, which can provide a unified framework for multiple clinical tasks and improve performance on downstream tasks. Additionally, there is a growing interest in vision-language pretraining, which enables representation learning from large-scale image-text pairs without relying on expensive manual annotations. Notable papers in this area include Mammo-FM, which introduces a breast-specific foundational model for integrated mammographic diagnosis, prognosis, and reporting, and STAMP, which resolves the trilemma in MLLM-based segmentation with simultaneous textual mask prediction. Other noteworthy papers include MedGrounder, which achieves strong zero-shot transfer and outperforms REC-style and grounded report generation baselines, and Panel2Patch, which enables granularity-aware pretraining and achieves substantially better performance with less pretraining data.

Sources

Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction

Generalized Medical Phrase Grounding

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation

Cross-Stain Contrastive Learning for Paired Immunohistochemistry and Histopathology Slide Representation Learning

PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation

SP-Det: Self-Prompted Dual-Text Fusion for Generalized Multi-Label Lesion Detection

Built with on top of