Efficient Neural Network Architectures

The field of neural networks is moving towards more efficient architectures, with a focus on reducing computational resources and improving performance. Researchers are exploring various techniques, such as mixed-precision quantization, attention mechanisms, and pruning methods, to achieve this goal. Notable papers in this area include MixA-Q, which proposes a mixed-precision activation quantization framework for efficient inference of quantized vision transformers, and EA-ViT, which introduces an efficient adaptation framework for elastic vision transformers. Other papers, such as EcoTransformer and TriangleMix, propose novel attention mechanisms and sparse attention patterns to reduce computational overhead. Additionally, papers like LinDeps and MOR-VIT demonstrate the effectiveness of post-pruning methods and dynamic recursion mechanisms in improving the efficiency of neural networks. Overall, the field is shifting towards more efficient and scalable architectures, with a focus on practical deployment and real-world applications.

Sources

MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

EA-ViT: Efficient Adaptation for Elastic Vision Transformer

Efficient Attention Mechanisms for Large Language Models: A Survey

Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning

EcoTransformer: Attention without Multiplication

Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators

Transformers as Unrolled Inference in Probabilistic Laplacian Eigenmaps: An Interpretation and Potential Improvements

TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling

LinDeps: A Fine-tuning Free Post-Pruning Method to Remove Layer-Wise Linear Dependencies with Guaranteed Performance Preservation

MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions

MSQ: Memory-Efficient Bit Sparsification Quantization

FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression

Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law