Multimodal Large Language Models

The field of multimodal large language models is moving towards improving efficiency and robustness. Researchers are exploring innovative methods to reduce computational costs and enhance performance, particularly in tasks that require joint modeling of visual and textual inputs. One notable direction is the development of token pruning techniques, which aim to select a compact yet representative subset of tokens to accelerate inference. Another area of focus is the elimination of alignment pre-training, which has been a major bottleneck in traditional multimodal learning approaches. Additionally, there is a growing interest in applying classical visual coding principles to multimodal large language models, with the goal of maximizing information fidelity while minimizing computational cost. Noteworthy papers in this area include: EVTP-IVS, which introduces a novel visual token pruning method that achieves up to 5X speed-up on video tasks and 3.5X on image tasks. Inverse-LLaVA, which proposes a new approach that eliminates alignment pre-training entirely and achieves notable improvements on reasoning-intensive tasks. Prune2Drive, which presents a plug-and-play visual token pruning framework for multi-view vision-language models in autonomous driving, achieving significant speedups and memory savings.

Sources

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Towards Efficient Vision State Space Models via Token Merging

GPT-2 as a Compression Preprocessor: Improving Gzip for Structured Text Domains

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent

Built with on top of