Robustness and Generalizability in Vision-Language Models

The field of vision-language models is moving towards improving robustness and generalizability, particularly in out-of-distribution (OOD) scenarios. Researchers are exploring novel approaches to enhance the reliability of these models in real-world applications. One notable direction is the development of federated learning methods that can adapt to diverse data distributions while maintaining data privacy. Another key area of focus is the creation of benchmarks and datasets that can effectively evaluate the OOD robustness of vision-language models. Noteworthy papers in this area include:

FOCoOp, which introduces a framework for enhancing OOD robustness in federated prompt learning for vision-language models.
Seeing What Matters, which presents a novel forensic-oriented data augmentation strategy for improving the generalizability of AI-generated video detectors.
LAION-C, which proposes a new benchmark dataset for evaluating OOD robustness in web-scale vision models.
BrokenVideos, which provides a comprehensive benchmark dataset for fine-grained artifact localization in AI-generated videos.
pFedDC, which proposes a personalized federated learning framework based on dual-prompt optimization and cross fusion.
DiMPLe, which introduces a novel approach to disentangled multi-modal prompt learning for enhancing OOD alignment with invariant and spurious feature separation.

Robustness and Generalizability in Vision-Language Models

Sources