Efficient Optimization and Quantization in Large Language Models

The field of large language models (LLMs) is moving towards more efficient optimization and quantization techniques to reduce memory consumption and improve performance. Researchers are exploring new methods to optimize LLM training, such as using information geometry and quantum metrics to better understand the optimization landscape. Additionally, there is a focus on developing more efficient quantization methods, including techniques that reduce rounding and clipping errors, and those that enable blockwise optimization to eliminate the need for memory-intensive full-model backpropagation. Noteworthy papers in this area include BASE-Q, which introduces a simple yet powerful approach that combines bias correction and asymmetric scaling to effectively reduce rounding and clipping errors, and PAROAttention, which proposes a novel technique to reorganize attention patterns to alleviate the challenges of sparsification and quantization. Other notable papers include AnTKV, which leverages Anchor Token-aware Vector Quantization to compress the KV cache, and Outlier-Safe Pre-Training, which introduces a practical guideline to proactively prevent outlier formation in LLM training.

Sources

BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models

Rethinking LLM Training through Information Geometry and Quantum Metrics

PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models

A Minimalist Optimizer Design for LLM Pretraining

AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models

Characterization and Mitigation of Training Instabilities in Microscaling Formats

Built with on top of