Advances in Multimodal Large Language Models

The field of multimodal large language models is rapidly evolving, with a focus on improving generalization and reasoning capabilities. Recent developments have centered around novel data mixing strategies, such as online data mixing and multi-domain data mixtures, which have shown significant promise in enhancing model performance. Additionally, researchers have proposed innovative architectures, including mixture-of-experts and task-aware mixture-of-experts, to mitigate task objective conflicts and improve overall coordination. Noteworthy papers in this area include Mixed-R1, which presents a unified reward perspective for reasoning capability in multimodal large language models, and Mixpert, which introduces an efficient mixture-of-vision-experts architecture for task-specific fine-tuning. Furthermore, the development of benchmarks like VS-Bench has enabled more comprehensive evaluations of vision language models in strategic reasoning and decision-making. Overall, these advancements are driving the field towards more robust and versatile multimodal models.

Sources

Actor-Critic based Online Data Mixing For Language Model Pre-Training

Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models

Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts

Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation

Fusing Cross-Domain Knowledge from Multimodal Data to Solve Problems in the Physical World

DUAL: Dynamic Uncertainty-Aware Learning

MiMo-VL Technical Report

Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

Towards LLM-Centric Multimodal Fusion: A Survey on Integration Strategies and Techniques