The field of multimodal models is moving towards improving inference efficiency, with a focus on reducing computational overhead and accelerating processing times. Recent developments have introduced various techniques, such as token pruning, merging, and dynamic importance estimation, to optimize the performance of these models. Notably, these advancements have been applied to diffusion-based multimodal large language models, vision-language models, and visual geometric transformers, demonstrating significant improvements in speed and efficiency without compromising accuracy.
Some noteworthy papers in this regard include: D$^{3}$ToM, which proposes a decider-guided dynamic token merging method to accelerate inference in diffusion MLLMs. RedVTP, which introduces a response-driven visual token pruning strategy for diffusion vision-language models, achieving significant improvements in token generation throughput and inference latency. Co-Me, which presents a confidence-guided token merging mechanism for visual geometric transformers, enabling substantial acceleration without degrading performance. VLA-Pruner, which offers a temporal-aware dual-level visual token pruning approach for efficient vision-language-action inference, aligning with the dual-system nature of VLA models and exploiting temporal continuity in robot manipulation.