Efficient Inference in Vision-Language Models

The field of vision-language models is moving towards improving inference efficiency without compromising accuracy. Researchers are exploring various methods to reduce computational costs, including visual token pruning, token selection, and model compression. These innovations aim to enable the deployment of vision-language models in latency-sensitive applications. Notable papers in this area include CoViPAL, which proposes a layer-wise contextualized visual token pruning method, and PoRe, which introduces a position-reweighted visual token pruning approach. Additionally, VISA presents a group-wise visual token selection and aggregation method, while MMTok leverages multimodal information for efficient inference. GM-Skip also contributes to this effort with a metric-guided transformer block skipping framework. These advancements have the potential to significantly improve the efficiency and scalability of vision-language models.

Efficient Inference in Vision-Language Models

Sources