Efficient Vision-Language Models

The field of vision-language models is moving towards more efficient and compressed models, with a focus on reducing computational overhead and inference latency. This is being achieved through various methods, including frequency domain compression, adaptive token pruning, and collaborative frameworks. These innovations are enabling the development of more practical and widely applicable vision-language models. Notable papers include: Fourier-VLM, which achieves competitive performance with strong generalizability and reduces inference FLOPs by up to 83.8%. AdaptInfer, which reduces CUDA latency by 61.3% while maintaining an average accuracy of 92.9% on vanilla LLaVA-1.5-7B. Small-Large Collaboration, which enables training-efficient concept personalization for large VLMs. LLMC+, which provides a comprehensive benchmark for vision-language model compression.

Sources

Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM

LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit

Built with on top of