Advances in Vision-Language Models

The field of vision-language models is moving towards more efficient and effective fine-tuning methods, with a focus on personalized and adaptable models. Recent developments have introduced novel adapter variants, integrated structural prompt learning, and online EM algorithms to enhance the performance of vision-language models on various downstream tasks. These innovations aim to address the challenges of maintaining strong generalization abilities, particularly towards unseen new classes, and improving the flexibility of test-time adaptation. Notable papers in this area include:

pFedMMA, which proposes a personalized federated learning framework with multi-modal adapters for vision-language tasks, achieving state-of-the-art trade-offs between personalization and generalization.
Dynamic Rank Adaptation, which introduces a novel adapter variant method to enhance new class generalization by dynamically allocating adaptation ranks based on feature importance.
Integrated Structural Prompt Learning, which proposes an integrated structural prompt to model the structural relationships between learnable prompts and tokens within and across modalities, improving the interaction of information representations between the text and image branches.
Free on the Fly, which introduces a training-free and universally available method for test-time adaptation, making no assumptions about accessing or storing historical training and test data.
Visual Instance-aware Prompt Tuning, which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis to retain important prompting information.

Advances in Vision-Language Models

Sources