Advances in Large Language Model Compression

The field of large language model compression is rapidly advancing, with a focus on developing innovative methods to reduce the size and computational cost of these models while preserving their performance. Recent developments have led to the proposal of various techniques, including generalized Fisher-weighted SVD, two-stage recoverable model pruning frameworks, and activation cosine similarity and variance-based pruning metrics. These methods aim to address the limitations of existing approaches, such as diagonal approximations of the Fisher information matrix and simplistic pruning techniques. Notably, some papers have introduced novel frameworks that combine channel-level pruning with layer-level collapse diagnosis, achieving extreme compression rates while maintaining high performance.

Some noteworthy papers in this area include: ACE, which proposes an efficient and effective pruning method that achieves high pruning performance and fast pruning speed with improved calibration efficiency. DenoiseRotator, which enhances pruning robustness for LLMs via importance concentration, consistently improving perplexity and zero-shot accuracy.

Sources

Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models

FCOS: A Two-Stage Recoverable Model Pruning Framework for Automatic Modulation Recognition

ACE: Exploring Activation Cosine Similarity and Variance for Accurate and Calibration-Efficient LLM Pruning

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

SlimLLM: Accurate Structured Pruning for Large Language Models

DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration

Built with on top of