Advancements in Vision-Language Models

The field of vision-language models is rapidly advancing, with a focus on improving the ability of models to understand and generate text and images. Recent developments have seen the introduction of new architectures and techniques, such as fine-tuning pre-trained models for specific tasks and using multimodal contrastive learning to improve representation learning. These advancements have led to significant improvements in performance on a range of tasks, including image-text retrieval and few-shot action recognition. Notably, innovative approaches to visual token pruning and tokenizer flexibility have also been proposed, allowing for more efficient and effective use of visual information. Overall, the field is moving towards more sophisticated and generalizable models that can effectively capture the complex relationships between text and images. Noteworthy papers include: Task-Adapter++ achieved state-of-the-art performance on 5 benchmarks for few-shot action recognition through its dual adaptation method. PRIOR introduced a simple vision-language pre-training approach that prioritizes image-related tokens, resulting in significant improvements on several vision-language benchmarks. MMRL++ proposed a parameter-efficient and interaction-aware representation learning method, achieving a strong balance between task-specific adaptation and generalization on 15 datasets.

Sources

Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos

Task-Adapter++: Task-specific Adaptation with Order-aware Alignment for Few-shot Action Recognition

Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models

Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering

MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Multi-Token Prediction Needs Registers

End-to-End Vision Tokenizer Tuning

Built with on top of