Advances in Vision-Language Models

The field of vision-language models is moving towards improving robustness and reliability, particularly in regards to linguistic variations such as paraphrasing and negation. Recent research has focused on developing new metrics and frameworks to evaluate and enhance the performance of vision-language models in these areas. Notably, innovative approaches have been proposed to address the challenges of negation and paraphrasing, including subspace modeling and contrastive loss functions. These advancements have the potential to improve the fairness and equity of vision-language models in socially sensitive contexts.

Some noteworthy papers in this area include: PRSM, which introduces a novel measure for quantifying CLIP's sensitivity to paraphrased queries. SpaceVLM, which proposes a training-free framework that models negation as a subspace in the joint embedding space. Language-Guided Invariance Probing, which introduces a benchmark to measure invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips. D4C, which proposes a data-free quantization framework tailored for CLIP models. Contrastive vision-language learning with paraphrasing and negation, which evaluates the combination of paraphrasing and negation and proposes a new CLIP contrastive loss function.

Sources

PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models

Language-Guided Invariance Probing of Vision-Language Models

D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models

Contrastive vision-language learning with paraphrasing and negation

Built with on top of