Efficient Vision-Language Models for Autonomous Driving

The field of autonomous driving is witnessing a significant shift towards the adoption of Vision-Language Models (VLMs) to enhance perception and decision-making. However, the real-time application of VLMs is hindered by high latency and computational overhead. Recent research has focused on addressing these limitations, with a particular emphasis on early exiting, structured labeling, and token compression. These innovative approaches have shown promising results in reducing latency and improving object detection accuracy. Noteworthy papers in this area include AD-EE, which proposes an Early Exit framework that reduces latency by up to 57.58% and enhances object detection accuracy by up to 44%, and Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving, which introduces a compact VLM baseline with 0.9B parameters that achieves competitive performance on structured datasets. Additionally, Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration proposes a many-to-many Token Transforming framework that reserves the most information and enables training-free acceleration, reducing FLOPs by 40% and accelerating DeiT-S by 1.5x with marginal 0.1% accuracy drop.

Efficient Vision-Language Models for Autonomous Driving

Sources