Advances in Vision-Language Models

The field of vision-language models is rapidly advancing, with a focus on improving the alignment between visual and textual representations. Recent studies have explored various approaches to enhance the robustness and effectiveness of these models, including the use of compositional awareness, visual detail capturing, and efficient text encoders. Notably, the development of new training methods, such as object-centric self-improving preference optimization and iterative self-improvement, has led to significant performance gains in tasks like text-to-image generation and image scoring. Furthermore, researchers have investigated the importance of sample and target interactions in training dynamics, proposing novel unified loss frameworks to assess their impact on training efficiency. Noteworthy papers include:

  • Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning, which introduces a novel fine-tuning method to improve compositionality in vision-language models.
  • un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP, which proposes a method to enhance the visual detail capturing ability of CLIP models.

Sources

On the Scaling of Robustness and Effectiveness in Dense Retrieval

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Efficient Text Encoders for Labor Market Analysis

Equally Critical: Samples, Targets, and Their Mappings in Datasets

Object-centric Self-improving Preference Optimization for Text-to-Image Generation

Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models

Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Language-Image Alignment with Fixed Text Encoders

Hierarchical Text Classification Using Contrastive Learning Informed Path Guided Hierarchy

Selective Matching Losses -- Not All Scores Are Created Equal

Skill-Driven Certification Pathways: Measuring Industry Training Impact on Graduate Employability

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Built with on top of