The field of surgical research is witnessing a significant shift towards the adoption of vision-language models (VLMs) to improve surgical understanding and automation. These models have demonstrated strong adaptability to diverse visual data and a range of downstream tasks, making them an attractive solution for addressing the complex challenges in surgical procedures. Recent developments have focused on exploring the capabilities and limitations of VLMs in the surgical domain, including their ability to link language to the correct regions in surgical scenes. Furthermore, researchers have proposed novel methods for representation learning in surgical workflow analysis using VLMs, which have shown promise in improving surgical phase recognition. Additionally, the integration of VLMs with robotic platforms has enabled the automation of biological experiments and the development of autonomous tracking systems for endoscopic procedures. Notable papers in this area include:
- Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery, which highlights key gaps in the models' ability to consistently link language to the correct regions in surgical scenes.
- ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model, which proposes a method for representation learning in surgical workflow analysis using a vision-language model and demonstrates its effectiveness in comparison to conventional methods.