Multimodal Large Language Models

The field of multimodal large language models is moving towards improving efficiency and robustness. Researchers are exploring innovative methods to reduce computational costs and enhance performance, particularly in tasks that require joint modeling of visual and textual inputs. One notable direction is the development of token pruning techniques, which aim to select a compact yet representative subset of tokens to accelerate inference. Another area of focus is the elimination of alignment pre-training, which has been a major bottleneck in traditional multimodal learning approaches. Additionally, there is a growing interest in applying classical visual coding principles to multimodal large language models, with the goal of maximizing information fidelity while minimizing computational cost. Noteworthy papers in this area include: EVTP-IVS, which introduces a novel visual token pruning method that achieves up to 5X speed-up on video tasks and 3.5X on image tasks. Inverse-LLaVA, which proposes a new approach that eliminates alignment pre-training entirely and achieves notable improvements on reasoning-intensive tasks. Prune2Drive, which presents a plug-and-play visual token pruning framework for multi-view vision-language models in autonomous driving, achieving significant speedups and memory savings.

Multimodal Large Language Models

Sources