Efficient Quantization Techniques for Large Language Models

The field of large language models is moving towards more efficient quantization techniques to reduce computational resources and memory footprint. Recent developments have focused on improving post-training quantization methods, such as wavelet-enhanced high-fidelity 1-bit quantization and adaptive transforms for joint weight-activation quantization. These methods have shown significant improvements in quantization fidelity and reduced performance degradation. Additionally, researchers have explored the use of intrinsic structure as a proxy for saliency in mixed-precision quantization, low-rank prehab for preparing neural networks for SVD compression, and phase-aware quantization schemes for complex-valued models. Noteworthy papers include HBLLM, which introduces a wavelet-enhanced high-fidelity 1-bit post-training quantization method, and FAIRY2I, which presents a universal framework for transforming pre-trained real-valued layers into an equivalent widely-linear complex form for extremely low-bit quantization. Overall, these advancements have the potential to enable more efficient deployment of large language models on commodity hardware.

Sources

HBLLM: Wavelet-Enhanced High-Fidelity 1-Bit Quantization for LLMs

WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

Intrinsic Structure as a Proxy for Saliency: SVD-Based Weight Preservation for Mixed-Precision Quantization in Large Language Models

Low-Rank Prehab: Preparing Neural Networks for SVD Compression

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization

Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing

ConvRot: Rotation-Based Plug-and-Play 4-bit Quantization for Diffusion Transformers

Convergence for Discrete Parameter Updates

BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training

SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

Built with on top of