Vision-Language Models in Medical Imaging

The field of medical imaging is witnessing significant developments in the application of Vision-Language Models (VLMs). A major challenge in this area is the ability of VLMs to accurately determine relative positions of anatomical structures and anomalies in medical images. Recent research has shown that state-of-the-art VLMs struggle with this task, relying more on prior anatomical knowledge than actual image content. To address this limitation, novel approaches such as visual prompts and knowledge decomposition are being explored. These methods aim to enhance the performance of VLMs in medical imaging by providing structured semantic supervision and bridging domain knowledge with spatial structure. Noteworthy papers in this area include: Your Other Left, which evaluates the ability of state-of-the-art VLMs to identify relative positions in medical images and introduces the Medical Imaging Relative Positioning benchmark dataset. Knowledge to Sight, which proposes a framework that introduces structured semantic supervision by decomposing clinical concepts into interpretable visual attributes, achieving performance on par with or better than larger models with limited training data.

Vision-Language Models in Medical Imaging

Sources