The field of large language model compression is rapidly advancing, with a focus on developing innovative methods to reduce the size and computational cost of these models while preserving their performance. Recent developments have led to the proposal of various techniques, including generalized Fisher-weighted SVD, two-stage recoverable model pruning frameworks, and activation cosine similarity and variance-based pruning metrics. These methods aim to address the limitations of existing approaches, such as diagonal approximations of the Fisher information matrix and simplistic pruning techniques. Notably, some papers have introduced novel frameworks that combine channel-level pruning with layer-level collapse diagnosis, achieving extreme compression rates while maintaining high performance.
Some noteworthy papers in this area include: ACE, which proposes an efficient and effective pruning method that achieves high pruning performance and fast pruning speed with improved calibration efficiency. DenoiseRotator, which enhances pruning robustness for LLMs via importance concentration, consistently improving perplexity and zero-shot accuracy.