Continual Learning and Vision-Language Models

The field of computer vision is undergoing significant advancements, with a focus on continual learning and vision-language models. Researchers are exploring ways to improve the performance of these models in various tasks, including image classification, object detection, and visual question answering. One of the key challenges in continual learning is the ability to adapt to new tasks and data distributions without forgetting previous knowledge. To address this, techniques such as synthetic replay, adversarial training, and knowledge distillation are being developed. Noteworthy papers in this area include LoRA-Loop, which proposes a LoRA-enhanced synthetic-replay framework for continual vision-language learning, and Franca, which presents a fully open-source vision foundation model that matches and surpasses the performance of state-of-the-art proprietary models. Additionally, papers like CLIPTTA and HiCroPL are exploring new approaches to test-time adaptation and prompt learning for vision-language models, demonstrating significant improvements in performance and robustness.

Sources

LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots

Exploring Scalable Unified Modeling for General Low-Level Vision

Hierarchical Cross-modal Prompt Learning for Vision-Language Models

One Last Attention for Your Vision-Language Model

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models

Cross-Modal Distillation For Widely Differing Modalities

PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models

MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training

ViRN: Variational Inference and Distribution Trilateration for Long-Tailed Continual Representation Learning

LMM-Det: Make Large Multimodal Models Excel in Object Detection