Efficient Inference in Multimodal Models

The field of multimodal models is moving towards improving inference efficiency, with a focus on reducing computational overhead and accelerating processing times. Recent developments have introduced various techniques, such as token pruning, merging, and dynamic importance estimation, to optimize the performance of these models. Notably, these advancements have been applied to diffusion-based multimodal large language models, vision-language models, and visual geometric transformers, demonstrating significant improvements in speed and efficiency without compromising accuracy.

Some noteworthy papers in this regard include: D$^{3}$ToM, which proposes a decider-guided dynamic token merging method to accelerate inference in diffusion MLLMs. RedVTP, which introduces a response-driven visual token pruning strategy for diffusion vision-language models, achieving significant improvements in token generation throughput and inference latency. Co-Me, which presents a confidence-guided token merging mechanism for visual geometric transformers, enabling substantial acceleration without degrading performance. VLA-Pruner, which offers a temporal-aware dual-level visual token pruning approach for efficient vision-language-action inference, aligning with the dual-system nature of VLA models and exploiting temporal continuity in robot manipulation.

Sources

Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

gfnx: Fast and Scalable Library for Generative Flow Networks in JAX

Built with on top of