Advances in Self-Supervised Adaptation for Visual Foundation Models

The field of visual foundation models is moving towards self-supervised adaptation methods that can efficiently learn from new domains without requiring annotations. This direction is driven by the need to improve performance in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. Recent innovations have focused on developing novel formulations of self-supervised fine-tuning that leverage multi-view object-centric videos and parameter-efficient adaptation techniques. These methods have shown consistent improvements in downstream classification tasks and have the potential to overcome traditional closed-set limitations. Noteworthy papers include VESSA, which introduces a self-distillation paradigm for self-supervised fine-tuning, and Test-Time Adaptive Object Detection with Foundation Model, which proposes a multi-modal prompt-based mean-teacher framework for test-time adaptation. Additionally, Prototype-Driven Adaptation for Few-Shot Object Detection presents a lightweight metric head that provides a prototype-based second opinion for few-shot object detection.

Advances in Self-Supervised Adaptation for Visual Foundation Models

Sources