Efficient Vision-Language-Action Models for Embodied Intelligence

The field of Vision-Language-Action (VLA) models is moving towards efficient and embodied intelligence, with a focus on reducing computational overhead and inference latency. Recent developments have introduced innovative techniques such as action-guided distillation, adaptive split computing, and progressive visual compression to enable real-time performance on resource-constrained devices.

Notable papers in this area include:

ActDistill, which achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup.
AVERY, which enables VLM deployment through adaptive split computing and achieves 11.2% higher accuracy than raw image compression and 93.98% lower energy consumption compared to full-edge execution.
Extreme Model Compression, which proposes two adaptive compression techniques - Sparse Temporal Token Fusion and Adaptive Neural Compression - that improve accuracy by up to 4.4% and reduce latency by up to 13x.
Compressor-VLA, which achieves a competitive success rate on the LIBERO benchmark while reducing FLOPs by 59% and the visual token count by over 3x compared to its baseline.
LLaVA-UHD v3, which demonstrates competitive performance with MoonViT while reducing TTFT by 2.4x and further reducing TTFT by 1.9x when developed within an identical MLLM architecture.

Efficient Vision-Language-Action Models for Embodied Intelligence

Sources