Advances in Vision-Language Models for Semantic Segmentation and Zero-Shot Learning

The field of vision-language models is rapidly advancing, with a focus on improving semantic segmentation and zero-shot learning capabilities. Recent developments have highlighted the importance of decoupling visual and textual modalities to enhance model performance, as well as the need for more effective fine-tuning strategies to adapt models to new tasks and domains. Notably, the use of multi-granularity feature calibration, collaborative harmonization, and conditional prompt synthesis has shown promising results in improving model robustness and generalization. Additionally, the development of novel frameworks such as CHARM and CoPS has enabled more effective cross-modal alignment and adaptation. Overall, these advancements have the potential to significantly improve the performance of vision-language models in a range of applications. Noteworthy papers include: Decouple before Align, which proposes a novel PT framework based on an intuitive decouple-before-align concept. CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages. CoPS, a framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance.

Sources

Decouple before Align: Visual Disentanglement Enhances Prompt Tuning

Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation

CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation

Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation

CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models

Unified modality separation: A vision-language framework for unsupervised domain adaptation

Textual and Visual Guided Task Adaptation for Source-Free Cross-Domain Few-Shot Segmentation

Navigating the Trade-off: A Synthesis of Defensive Strategies for Zero-Shot Adversarial Robustness in Vision-Language Models

Textual Inversion for Efficient Adaptation of Open-Vocabulary Object Detectors Without Forgetting

Built with on top of