Advances in Large Language Model Quantization

The field of large language models (LLMs) is moving towards more efficient and compressed models, with a focus on post-training quantization methods. Recent research has explored innovative approaches to alleviate quantization error and improve model performance. One direction is the use of null space optimization, which constrains the post-quantization weight perturbation to lie within the null space of input activations. Another approach is adaptive mixed-precision delta-compression, which optimizes quantization error and achieves better performance at high compression ratios. Additionally, ultra low-bit quantization methods have been proposed, which represent weights in a low-rank form using latent matrix factorization and binarizing these factors. Module-wise weight decay techniques have also been developed to balance the structural diversity of LLMs and improve performance. Noteworthy papers include:

Boost Post-Training Quantization via Null Space Optimization for Large Language Models, which introduces the concept of null space into LLMs quantization and proposes a plug-and-play null space projection module.
ADAMIX: Adaptive Mixed-Precision Delta-Compression with Quantization Error Optimization for Large Language Models, which provides a mathematical derivation of quantization error and formulates the optimal mixed-precision bit allocation scheme.
LittleBit: Ultra Low-Bit Quantization via Latent Factorization, which achieves nearly 31x memory reduction and establishes a superior size-performance trade-off.
AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs, which adaptively assigns different weight decay strengths to each module of an LLM and improves performance.

Advances in Large Language Model Quantization

Sources