Vision-Language Models for Out-of-Distribution Generalization and Zero-Shot Learning

The field of vision-language models is moving towards improving out-of-distribution generalization and zero-shot learning capabilities. Recent research has focused on developing novel methods to enhance the alignment between vision and language embeddings, allowing for more accurate and robust representations. Additionally, there is a growing interest in leveraging large vision-language models as a reusable semantic proxy for various downstream tasks, such as visual document retrieval and image classification. Noteworthy papers include: CoDoL, which proposes a conditional domain prompt learning method to improve out-of-distribution generalization. SERVAL, which achieves state-of-the-art results in zero-shot visual document retrieval using a generate-and-encode pipeline. Efficient Long-Tail Learning, which leverages the latent space of vision foundation models to generate synthetic data for long-tail classification. Prompt Optimization Meets Subspace Representation Learning, which integrates subspace representation learning with prompt tuning for few-shot out-of-distribution detection. No Labels Needed, which proposes a novel zero-shot image classification framework that combines a vision-language model and a pre-trained visual model within a self-learning cycle.

Sources

CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Efficient Long-Tail Learning in Latent Space by sampling Synthetic Data

Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection

MMSE-Calibrated Few-Shot Prompting for Alzheimer's Detection

Built with on top of