Efficient Vision Transformers

The field of computer vision is moving towards more efficient and scalable architectures, particularly with the development of Vision Transformers (ViTs). Researchers are exploring ways to reduce the computational and memory demands of ViTs, making them more suitable for deployment on resource-constrained platforms. One direction is the development of novel feedforward network designs, such as cascaded chunk-feedforward networks, which improve parameter and FLOP efficiency without sacrificing accuracy. Another approach is the use of asymmetric aggregation methods, which enable more effective feature-to-cluster assignment and improve matching accuracy in visual place recognition tasks. Additionally, there is a growing interest in unifying convolution and self-attention within a single framework, which can provide a more principled and interpretable approach to vision architecture design. Noteworthy papers in this area include:

Stratified Knowledge-Density Super-Network for Scalable Vision Transformers, which proposes a method for transforming a pre-trained ViT into a stratified knowledge-density super-network, enabling flexible extraction of sub-networks that retain maximal knowledge for varying model sizes.
CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer, which introduces a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design.
Attention Via Convolutional Nearest Neighbors, which unifies convolution and self-attention within a single k-nearest neighbor aggregation framework.

Efficient Vision Transformers

Sources