Efficient Pruning Techniques for Large Language Models

The field of large language models is moving towards more efficient pruning techniques to reduce computational requirements and storage needs. Recent developments have focused on improving the accuracy of N:M sparse models, mitigating pruning-induced errors, and optimizing channel permutations. Notably, novel pruning frameworks and criteria have been proposed to address the limitations of traditional methods, such as handcrafted quality metrics and greedy heuristics. These advancements have led to significant improvements in model performance and compression ratios. Noteworthy papers include: PermLLM, which introduces learnable channel permutation for N:M sparsity, achieving superior performance in optimizing N:M sparse models. Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe, which drastically reduces the per-layer pruning error and outperforms strong baselines on state-of-the-art GPT architectures. Entropy Meets Importance, which introduces a unified head importance-entropy score for stable and efficient transformer pruning, yielding up to 15.2% improvement in model quality and 2.04x improvement in stability. A Free Lunch in LLM Compression, which challenges common intuitions about retraining after pruning and demonstrates that reconstructing attention and MLP components separately can achieve better performance than full retraining.

Sources

PermLLM: Learnable Channel Permutation for N:M Sparse Large Language Models

Don't Be Greedy, Just Relax! Pruning LLMs via Frank-Wolfe

Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

Built with on top of