Efficient Inference in Vision-Language Models

The field of vision-language models is moving towards improving inference efficiency without compromising accuracy. Researchers are exploring various methods to reduce computational costs, including visual token pruning, token selection, and model compression. These innovations aim to enable the deployment of vision-language models in latency-sensitive applications. Notable papers in this area include CoViPAL, which proposes a layer-wise contextualized visual token pruning method, and PoRe, which introduces a position-reweighted visual token pruning approach. Additionally, VISA presents a group-wise visual token selection and aggregation method, while MMTok leverages multimodal information for efficient inference. GM-Skip also contributes to this effort with a metric-guided transformer block skipping framework. These advancements have the potential to significantly improve the efficiency and scalability of vision-language models.

Sources

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models

PoRe: Position-Reweighted Visual Token Pruning for Vision Language Models

VISA: Group-wise Visual Token Selection and Aggregation via Graph Summarization for Efficient MLLMs Inference

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

GM-Skip: Metric-Guided Transformer Block Skipping for Efficient Vision-Language Models

Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

Built with on top of