The field of medical diagnosis is witnessing a significant shift towards the adoption of vision-language models, which integrate visual features from medical images with textual descriptors from clinical metadata and radiology reports. This synergy enables the development of more accurate and clinically relevant diagnostic models. Recent studies have demonstrated the potential of these models in various applications, including breast cancer diagnosis, thyroid disease detection, and pathology retrieval. The use of multi-modal learning approaches, such as contrastive learning and self-supervised learning, has shown promise in improving the performance of these models. Notably, the incorporation of expert knowledge and semantic cues has been found to enhance the discriminative capability of vision-language models. Overall, the field is moving towards the development of more sophisticated and clinically viable vision-language models that can effectively leverage imaging data and contextual patient information. Noteworthy papers include: Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung's Disease, which proposes a novel framework for integrating expert-derived textual concepts into a vision-language model. CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments, which demonstrates the effectiveness of a multi-branch recognition framework in recognizing agricultural diseases. Accurate and Scalable Multimodal Pathology Retrieval via Attentive Vision-Language Alignment, which presents a retrieval framework that unifies fine-grained attentive mosaic representations with global-wise slide embeddings aligned through vision-language contrastive learning.
Advances in Vision-Language Models for Medical Diagnosis
Sources
CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments
Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries