The field of medical imaging is witnessing significant advancements with the integration of multimodal large language models (MLLMs) and vision-language pretraining. These innovations are enhancing the understanding and analysis of medical images, including 3D volumes and panoramic X-rays, by leveraging text descriptions and radiology reports. Notably, the development of novel pretraining frameworks and benchmarks is improving the performance of medical AI systems, enabling more accurate and scalable image understanding and interpretation.
Particularly noteworthy papers include: Med3DInsight, which introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation, promoting the progress of intelligent dentistry. GLAM, a geometry-guided local alignment method for multi-view VLP in mammography, demonstrating improved performance on downstream tasks. More performant and scalable, a approach that utilizes LLMs to facilitate large-scale supervised pretraining, advancing vision-language alignment and achieving state-of-the-art performance. Report2CT, a radiology report conditional latent diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports, producing clinically faithful and high-quality synthetic data.