Advances in Vision-Language Models and Adaptive Learning

The field of vision-language models and adaptive learning is rapidly evolving, with a focus on improving model robustness, efficiency, and generalization capabilities. Recent developments have led to the introduction of novel activation functions, such as SG-Blend, which combines the strengths of Swish and GELU to achieve more robust neural representations. Additionally, proxy-based methods like Proxy-FDA have been proposed to mitigate concept forgetting in fine-tuning vision foundation models. Researchers have also explored adaptive model updates under constrained resource budgets, such as RCCDA, which optimizes model training dynamics while ensuring strict compliance to predefined resource constraints. Furthermore, vision-language models have been improved with the introduction of methods like GeoVision Labeler, which enables zero-shot geospatial classification, and OASIS, an adaptive online sample selection approach for continual visual instruction tuning. Noteworthy papers include SG-Blend, which achieves state-of-the-art performance on various tasks, and Proxy-FDA, which significantly reduces concept forgetting during fine-tuning.

Sources

SG-Blend: Learning an Interpolation Between Improved Swish and GELU for Robust Neural Representations

Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting

RCCDA: Adaptive Model Updates in the Presence of Concept Drift under a Constrained Resource Budget

GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models

OASIS: Online Sample Selection for Continual Visual Instruction Tuning

Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet

MINT: Memory-Infused Prompt Tuning at Test-time for CLIP

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Budgeted Online Active Learning with Expert Advice and Episodic Priors

Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Vocabulary-free few-shot learning for Vision-Language Models

Backbone Augmented Training for Adaptations

Robust Few-Shot Vision-Language Model Adaptation

Reliably detecting model failures in deployment without labels

Built with on top of