Efficient Multimodal Processing

The field of multimodal research is moving towards more efficient processing of large amounts of data, particularly in the context of vision-language models. Recent developments have focused on reducing computational costs and improving inference speed, while maintaining strong performance. Notable advancements include the proposal of novel token pruning strategies, such as adaptive visual token pruning and variation-aware vision token dropping, which have been shown to significantly reduce token counts and improve efficiency. Additionally, new methods for accelerating large multimodal models, such as pyramid token merging and KV cache compression, have been introduced. These innovations have the potential to enable the deployment of large multimodal models in real-world applications. Some noteworthy papers in this regard include Towards Adaptive Visual Token Pruning for Large Multimodal Models, which proposes a mutual information-based token pruning strategy, and LightVLM, which introduces a simple yet effective method for accelerating large multimodal models.

Sources

Towards Adaptive Visual Token Pruning for Large Multimodal Models

HERO-VQL: Hierarchical, Egocentric and Robust Visual Query Localization

LightVLM: Acceleraing Large Multimodal Models with Pyramid Token Merging and KV Cache Compression

Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors

Do Video Language Models Really Know Where to Look? Diagnosing Attention Failures in Video Language Models

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

TinyDrop: Tiny Model Guided Token Dropping for Vision Transformers

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

Leveraging Vision-Language Large Models for Interpretable Video Action Recognition with Semantic Tokenization

Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

Harnessing Object Grounding for Time-Sensitive Video Understanding

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Phantom-Insight: Adaptive Multi-cue Fusion for Video Camouflaged Object Detection with Multimodal LLM