Efficient Inference and Interpretability in Vision-Language Models

The field of vision-language models is moving towards improving inference efficiency and interpretability. Recent developments have focused on pruning methods to reduce computational costs while maintaining performance. These methods aim to selectively retain high-utility context, prune redundant visual tokens, and adapt to varying sample and task complexities. Notable papers in this area include KV-Efficient VLA, which introduces a lightweight memory compression framework, and AutoPrune, a training-free framework that tailors pruning policies to varying sample and task complexities. Additionally, papers like AFFORD2ACT and GUI-KV have proposed innovative approaches to keypoint selection and cache compression, respectively, demonstrating the potential for significant improvements in efficiency and performance.

Sources

KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache

Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence

Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation

Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

AutoPrune: Each Complexity Deserves a Pruning Policy

FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing

Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value