The field of vision-language models is rapidly advancing, with a focus on improving the ability of models to understand and generate text and images. Recent developments have seen the introduction of new architectures and techniques, such as fine-tuning pre-trained models for specific tasks and using multimodal contrastive learning to improve representation learning. These advancements have led to significant improvements in performance on a range of tasks, including image-text retrieval and few-shot action recognition. Notably, innovative approaches to visual token pruning and tokenizer flexibility have also been proposed, allowing for more efficient and effective use of visual information. Overall, the field is moving towards more sophisticated and generalizable models that can effectively capture the complex relationships between text and images. Noteworthy papers include: Task-Adapter++ achieved state-of-the-art performance on 5 benchmarks for few-shot action recognition through its dual adaptation method. PRIOR introduced a simple vision-language pre-training approach that prioritizes image-related tokens, resulting in significant improvements on several vision-language benchmarks. MMRL++ proposed a parameter-efficient and interaction-aware representation learning method, achieving a strong balance between task-specific adaptation and generalization on 15 datasets.
Advancements in Vision-Language Models
Sources
Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models
Why 1 + 1 < 1 in Visual Token Pruning: Beyond Naive Integration via Multi-Objective Balanced Covering