The field of vision-language models is moving towards improving out-of-distribution generalization and zero-shot learning capabilities. Recent research has focused on developing novel methods to enhance the alignment between vision and language embeddings, allowing for more accurate and robust representations. Additionally, there is a growing interest in leveraging large vision-language models as a reusable semantic proxy for various downstream tasks, such as visual document retrieval and image classification. Noteworthy papers include: CoDoL, which proposes a conditional domain prompt learning method to improve out-of-distribution generalization. SERVAL, which achieves state-of-the-art results in zero-shot visual document retrieval using a generate-and-encode pipeline. Efficient Long-Tail Learning, which leverages the latent space of vision foundation models to generate synthetic data for long-tail classification. Prompt Optimization Meets Subspace Representation Learning, which integrates subspace representation learning with prompt tuning for few-shot out-of-distribution detection. No Labels Needed, which proposes a novel zero-shot image classification framework that combines a vision-language model and a pre-trained visual model within a self-learning cycle.
Vision-Language Models for Out-of-Distribution Generalization and Zero-Shot Learning
Sources
SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models
Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection