Efficient Vision-Language Models for Autonomous Driving

The field of autonomous driving is witnessing a significant shift towards the adoption of Vision-Language Models (VLMs) to enhance perception and decision-making. However, the real-time application of VLMs is hindered by high latency and computational overhead. Recent research has focused on addressing these limitations, with a particular emphasis on early exiting, structured labeling, and token compression. These innovative approaches have shown promising results in reducing latency and improving object detection accuracy. Noteworthy papers in this area include AD-EE, which proposes an Early Exit framework that reduces latency by up to 57.58% and enhances object detection accuracy by up to 44%, and Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving, which introduces a compact VLM baseline with 0.9B parameters that achieves competitive performance on structured datasets. Additionally, Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration proposes a many-to-many Token Transforming framework that reserves the most information and enables training-free acceleration, reducing FLOPs by 40% and accelerating DeiT-S by 1.5x with marginal 0.1% accuracy drop.

Sources

AD-EE: Early Exiting for Fast and Reliable Vision-Language Models in Autonomous Driving

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Recipes for Pre-training LLMs with MXFP8

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning

FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Built with on top of