Efficient Deployment of Vision Transformers

The field of vision transformers is moving towards efficient deployment, with a focus on reducing computational complexity and memory demands. Researchers are exploring innovative methods to accelerate model inference, including exploiting information redundancy in attention maps and optimizing neural networks with learnable non-linear activation functions. These advancements have the potential to enable the deployment of vision transformers at the edge, where energy efficiency and low latency are crucial. Notable papers in this area include:

  • One that proposes Entropy Attention Maps (EAM) to achieve similar or higher accuracy at low sparsity in attention maps.
  • Another that presents a reconfigurable lookup architecture with edge FPGAs to minimize energy-intensive arithmetic operations while preserving activation fidelity.
  • A third that designs a compression method to reduce the storage and transmission costs of neural network weights in low-precision formats.
  • A fourth that proposes a low-bit, model-specialized accelerator to support the deployment of vision transformers, achieving high accuracy and power efficiency on various benchmarks.

Sources

Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers

Optimizing Neural Networks with Learnable Non-Linear Activation Functions via Lookup-Based FPGA Acceleration

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Systolic Array-based Architecture for Low-Bit Integerized Vision Transformers

Built with on top of