Efficient Token Merging in Vision Transformers

The field of vision transformers is moving towards more efficient architectures, with a focus on token merging techniques that preserve spatial structure and reduce computational costs. Recent developments have introduced novel methods for merging tokens, including spatial-preserving token merging, progressive spatio-temporal token selection, and clustering-based token merging. These approaches have achieved significant efficiency gains, including reduced FLOPs and increased FPS, while maintaining competitive accuracy across various vision tasks. Noteworthy papers include: CubistMerge, which introduces a simple yet effective token merging method that maintains spatial integrity, enabling seamless compatibility with spatial architectures. PSTTS, which proposes a Plug-and-Play module for event data that effectively identifies and discards spatio-temporal redundant tokens, achieving an optimal trade-off between accuracy and efficiency. Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding, which proposes a decoupled strategy that reduces token redundancy along temporal and spatial dimensions independently, achieving significant efficiency gains with over 65% GFLOPs reduction. ClustViT, which expands upon the Vision Transformer backbone and addresses semantic segmentation using a trainable Cluster module that merges similar tokens along the network guided by pseudo-clusters from segmentation masks.

Efficient Token Merging in Vision Transformers

Sources