Quantization Techniques for Efficient Deep Learning

The field of deep learning is moving towards developing more efficient models that can be deployed on resource-constrained devices. Recent research has focused on quantization techniques that can reduce the memory footprint and computational overhead of deep neural networks. These techniques aim to compress the models while maintaining their performance, making them suitable for deployment on edge devices. Notable advancements include the development of learnable quantization methods, layer-wise ultra-low bit quantization, and cross-layer guided orthogonal-based quantization. These innovations have shown promising results in reducing the memory consumption and improving the inference speed of deep learning models.

Some noteworthy papers in this area include: LUQ, which proposes a novel strategy for multimodal large language model quantization, achieving 40% and 31% less memory usage than 4-bit counterparts. CLQ, which introduces a cross-layer guided orthogonal-based quantization method for diffusion transformers, achieving 3.98x memory saving and 3.95x speedup. RSAVQ, which proposes a Riemannian sensitivity-aware vector quantization framework for large language models, outperforming existing methods in 2-bit quantization.

Sources

$\gamma$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers

Norm-Q: Effective Compression Method for Hidden Markov Models in Neuro-Symbolic Applications

Cat: Post-training quantization error reduction via cluster-based affine transformation

Post-Training Quantization via Residual Truncation and Zero Suppression for Diffusion Models

RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models

Built with on top of