The field of vision-language models is rapidly advancing, with a focus on improving semantic segmentation and zero-shot learning capabilities. Recent developments have highlighted the importance of decoupling visual and textual modalities to enhance model performance, as well as the need for more effective fine-tuning strategies to adapt models to new tasks and domains. Notably, the use of multi-granularity feature calibration, collaborative harmonization, and conditional prompt synthesis has shown promising results in improving model robustness and generalization. Additionally, the development of novel frameworks such as CHARM and CoPS has enabled more effective cross-modal alignment and adaptation. Overall, these advancements have the potential to significantly improve the performance of vision-language models in a range of applications. Noteworthy papers include: Decouple before Align, which proposes a novel PT framework based on an intuitive decouple-before-align concept. CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages. CoPS, a framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance.
Advances in Vision-Language Models for Semantic Segmentation and Zero-Shot Learning
Sources
CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation
Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation