Quantization Techniques for Efficient Large Language Models

The field of large language models (LLMs) is moving towards more efficient deployment through the use of quantization techniques. Researchers are exploring various methods to reduce model precision while maintaining performance, including pseudo-quantization training, quantization-aware training, and native low-precision training. These approaches aim to address the challenges of substantial computational and memory demands of LLMs. Notably, innovative methods such as Gaussian weight sampling, outlier token tracing, and quartet-based training have shown promising results in achieving scalable, efficient, and stable training. Additionally, studies on the effects of quantization on code-generating LLMs and the development of new metrics for measuring trojan signals have provided valuable insights into the potential risks and benefits of quantization. Overall, the field is advancing towards more efficient and deployable LLMs through the development of effective quantization techniques. Noteworthy papers include: Accurate KV Cache Quantization with Outlier Tokens Tracing, which achieves significant accuracy improvements under 2-bit quantization. Quartet: Native FP4 Training Can Be Optimal for Large Language Models, which enables accurate end-to-end FP4 training with all major computations performed in low precision.

Sources

M|D|$\infty$ Queue Busy Period and Busy Cycle Distributions Computational Calculus

Accurate KV Cache Quantization with Outlier Tokens Tracing

Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training

Capturing the Effects of Quantization on Trojans in Code LLMs

Scaling Law for Quantization-Aware Training

Quartet: Native FP4 Training Can Be Optimal for Large Language Models

Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis

Modeling and Optimizing Latency for Delayed Hit Caching with Stochastic Miss Latency

Is (Selective) Round-To-Nearest Quantization All You Need?

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Built with on top of