Efficient Processing of Long-Context Data in Multimodal Models

The field of multimodal learning is moving towards more efficient processing of long-context data, with a focus on reducing computational costs and improving performance. Researchers are exploring various techniques, such as token pruning, attention approximation, and modality-agnostic architectures, to mitigate the quadratic complexity of self-attention mechanisms. These innovations have the potential to enable more widespread adoption of multimodal models in real-world applications. Noteworthy papers in this area include: Adapt, But Don't Forget, which proposes a framework for fine-tuning and contrastive routing for lane detection under distribution shift. MAELRE, which introduces a modality-agnostic efficient long-range encoder for long-context processing. TR-PTS, which proposes a task-driven framework for efficient tuning of large pre-trained models. FastDriveVLA, which presents a reconstruction-based token pruning framework for efficient end-to-end driving. Short-LVLM, which compresses and accelerates large vision-language models by pruning redundant layers.

Sources

Adapt, But Don't Forget: Fine-Tuning and Contrastive Routing for Lane Detection under Distribution Shift

Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

Adapting to Fragmented and Evolving Data: A Fisher Information Perspective

Modality Agnostic Efficient Long Range Encoder

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers