Quantization and Efficient Deployment of Neural Networks

The field of neural networks is moving towards more efficient deployment and quantization methods. Recent developments have focused on improving the accuracy and speed of quantized neural networks, as well as reducing the memory and computational requirements. Notably, researchers have proposed novel quantization schemes, such as adaptive distribution-aware quantization and bit-shifting quantization, which have shown promising results in achieving high accuracy with low precision weights and activations. Additionally, hardware-software co-design approaches have been explored to optimize the deployment of neural networks on resource-constrained devices. Overall, the field is advancing towards more efficient and scalable neural network deployment, enabling wider adoption in real-world applications.

Noteworthy papers include: SpikeFit, which introduces a novel training method for Spiking Neural Networks that enables efficient inference on neuromorphic hardware. MetaCluster, which proposes a framework for compressing Kolmogorov-Arnold Networks without sacrificing accuracy. AccuQuant, which presents a novel post-training quantization method for diffusion models that simulates multiple denoising steps to alleviate error accumulation.

Sources

On the Generalization Properties of Learning the Random Feature Models with Learnable Activation Functions

SpikeFit: Towards Optimal Deployment of Spiking Networks on Neuromorphic Hardware

Optimization of the quantization of dense neural networks from an exact QUBO formulation

Differentiable, Bit-shifting, and Scalable Quantization without training neural network from scratch

One-Bit Quantization for Random Features Models

SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference

Bitwidth-Specific Logarithmic Arithmetic for Future Hardware-Accelerated Training

MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network

Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation

Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks

AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

Built with on top of