Advances in Vision-Language Models

The field of vision-language models is rapidly advancing, with a focus on improving fine-tuning methods and adapting to new tasks and domains. Recent developments have centered around enhancing the performance of pre-trained models on few-shot learning tasks, mitigating cross-image information leakage, and addressing class imbalance issues. Notably, researchers have proposed novel fine-tuning strategies, such as manifold-aligned fine-tuning and dynamic prompt routing, to improve model performance while preserving the structure of the semantic manifold. Additionally, contrastive learning and multi-view collaborative optimization have been explored to enhance the robustness of feature learning. Overall, these advancements are pushing the boundaries of vision-language models and their applications in various tasks. Noteworthy papers include: Better Supervised Fine-tuning for VQA: Integer-Only Loss, which proposes a novel fine-tuning approach for video quality assessment tasks. Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning, which develops a latent hierarchical adapter for fine-tuning VLMs on downstream few-shot classification tasks.

Sources

Better Supervised Fine-tuning for VQA: Integer-Only Loss

Fine-Grained VLM Fine-tuning via Latent Hierarchical Adapter Learning

Borrowing From the Future: Enhancing Early Risk Assessment through Contrastive Learning

Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections

Infusing fine-grained visual knowledge to Vision-Language Models

CLAIR: CLIP-Aided Weakly Supervised Zero-Shot Cross-Domain Image Retrieval

Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models

Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

Built with on top of