Advancements in Vision-Language Models

The field of vision-language models is rapidly evolving, with a focus on improving factuality, caption quality, and data efficiency. Recent developments have led to the creation of novel methods for measuring and improving open-vocabulary factuality, as well as leaderboards for evaluating detailed image captioning. Additionally, there is a growing emphasis on data-efficient visual supervision, human-AI collaboration, and compact models for in-context judging of image-text data quality. These advancements have the potential to significantly impact various applications, including image captioning, object detection, and visual question answering. Noteworthy papers include: OVFact, which introduces a novel method for measuring caption factuality, and LOTUS, which provides a comprehensive leaderboard for evaluating detailed captions. OW-CLIP is also notable for its data-efficient visual supervision approach, while Trust the Model presents a compact VLM for in-context judging of image-text data quality.

Sources

OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration

Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality

Snap, Segment, Deploy: A Visual Data and Detection Pipeline for Wearable Industrial Assistants

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Multilingual JobBERT for Cross-Lingual Job Title Matching

MetaCLIP 2: A Worldwide Scaling Recipe

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Built with on top of