Advances in Vision-Language Models

The field of vision-language models is rapidly advancing, with a focus on improving fine-tuning methods and adapting to new tasks and domains. Recent developments have centered around enhancing the performance of pre-trained models on few-shot learning tasks, mitigating cross-image information leakage, and addressing class imbalance issues. Notably, researchers have proposed novel fine-tuning strategies, such as manifold-aligned fine-tuning and dynamic prompt routing, to improve model performance while preserving the structure of the semantic manifold. Additionally, contrastive learning and multi-view collaborative optimization have been explored to enhance the robustness of feature learning. Overall, these advancements are pushing the boundaries of vision-language models and their applications in various tasks. Noteworthy papers include: Better Supervised Fine-tuning for VQA: Integer-Only Loss, which proposes a novel fine-tuning approach for video quality assessment tasks. Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning, which develops a latent hierarchical adapter for fine-tuning VLMs on downstream few-shot classification tasks.

Advances in Vision-Language Models

Sources