The field of vision-language models is moving towards improving robustness and reliability, particularly in regards to linguistic variations such as paraphrasing and negation. Recent research has focused on developing new metrics and frameworks to evaluate and enhance the performance of vision-language models in these areas. Notably, innovative approaches have been proposed to address the challenges of negation and paraphrasing, including subspace modeling and contrastive loss functions. These advancements have the potential to improve the fairness and equity of vision-language models in socially sensitive contexts.
Some noteworthy papers in this area include: PRSM, which introduces a novel measure for quantifying CLIP's sensitivity to paraphrased queries. SpaceVLM, which proposes a training-free framework that models negation as a subspace in the joint embedding space. Language-Guided Invariance Probing, which introduces a benchmark to measure invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips. D4C, which proposes a data-free quantization framework tailored for CLIP models. Contrastive vision-language learning with paraphrasing and negation, which evaluates the combination of paraphrasing and negation and proposes a new CLIP contrastive loss function.