Advances in Multimodal Large Language Models

The field of multimodal large language models is moving towards more efficient and effective methods for model merging, continual learning, and task adaptation. Researchers are exploring new techniques to address challenges such as expert uniformity, router rigidity, and catastrophic forgetting. Notable trends include the use of layer-wise task vector fusion, dynamic token-aware routing, and branch-based LoRA frameworks to improve model performance and reduce interference. Some noteworthy papers in this area include EvoMoE, which introduces a novel expert initialization strategy and dynamic routing mechanism, and BranchLoRA, which enhances multimodal continual instruction tuning with a flexible tuning-freezing mechanism. Additionally, StatsMerging offers a lightweight learning-based model merging method guided by weight distribution statistics, and Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts achieves task recognition by clustering visual-text embeddings.

Sources

EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models

Towards Minimizing Feature Drift in Model Merging: Layer-wise Task Vector Fusion for Adaptive Knowledge Integration

On Fairness of Task Arithmetic: The Role of Task Vectors

Enhancing Multimodal Continual Instruction Tuning with BranchLoRA

FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging

Continual Learning in Vision-Language Models via Aligned Model Merging

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning

Built with on top of