Efficient Vision-Language Models

The field of vision-language models is moving towards more efficient and compressed models, with a focus on reducing computational overhead and inference latency. This is being achieved through various methods, including frequency domain compression, adaptive token pruning, and collaborative frameworks. These innovations are enabling the development of more practical and widely applicable vision-language models. Notable papers include: Fourier-VLM, which achieves competitive performance with strong generalizability and reduces inference FLOPs by up to 83.8%. AdaptInfer, which reduces CUDA latency by 61.3% while maintaining an average accuracy of 92.9% on vanilla LLaVA-1.5-7B. Small-Large Collaboration, which enables training-efficient concept personalization for large VLMs. LLMC+, which provides a comprehensive benchmark for vision-language model compression.

Efficient Vision-Language Models

Sources