Advances in Vision-Language Models

The field of vision-language models is moving towards more efficient and adaptable architectures, with a focus on test-time adaptation and open-vocabulary learning. Recent developments have shown that Bayesian inference and dynamic caching can be used to improve the performance of vision-language models in object recognition and detection tasks. Additionally, there is a growing interest in adapting these models to new environments and tasks, such as aerial imagery and remote sensing. Notable papers in this area include Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models, which introduces a unified framework for test-time adaptation, and Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning, which proposes a novel method for estimating the distribution of unseen classes. Furthermore, the development of benchmarks for evaluating few-shot adaptation methods, such as the Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models, is expected to drive progress in this area.

Sources

Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

Efficient Test-Time Scaling for Small Vision-Language Models

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Built with on top of