The field of deep learning is moving towards developing more efficient models that can be deployed on resource-constrained devices. Recent research has focused on quantization techniques that can reduce the memory footprint and computational overhead of deep neural networks. These techniques aim to compress the models while maintaining their performance, making them suitable for deployment on edge devices. Notable advancements include the development of learnable quantization methods, layer-wise ultra-low bit quantization, and cross-layer guided orthogonal-based quantization. These innovations have shown promising results in reducing the memory consumption and improving the inference speed of deep learning models.
Some noteworthy papers in this area include: LUQ, which proposes a novel strategy for multimodal large language model quantization, achieving 40% and 31% less memory usage than 4-bit counterparts. CLQ, which introduces a cross-layer guided orthogonal-based quantization method for diffusion transformers, achieving 3.98x memory saving and 3.95x speedup. RSAVQ, which proposes a Riemannian sensitivity-aware vector quantization framework for large language models, outperforming existing methods in 2-bit quantization.