The field of vision-language models is moving towards improving inference efficiency and interpretability. Recent developments have focused on pruning methods to reduce computational costs while maintaining performance. These methods aim to selectively retain high-utility context, prune redundant visual tokens, and adapt to varying sample and task complexities. Notable papers in this area include KV-Efficient VLA, which introduces a lightweight memory compression framework, and AutoPrune, a training-free framework that tailors pruning policies to varying sample and task complexities. Additionally, papers like AFFORD2ACT and GUI-KV have proposed innovative approaches to keypoint selection and cache compression, respectively, demonstrating the potential for significant improvements in efficiency and performance.
Efficient Inference and Interpretability in Vision-Language Models
Sources
Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score