The field of large language models (LLMs) is moving towards more efficient compression techniques to reduce computational demands and improve deployment feasibility. Recent advances focus on dynamic and adaptive pruning methods that prioritize critical model components and tokens, preserving performance while minimizing parameter scale. Notably, innovative approaches are exploring the integration of pruning techniques with fine-tuning methods, as well as the development of novel layer pruning strategies that address the mismatches in activation magnitudes across layers and tokens.
Some noteworthy papers include: DLP, which proposes a dynamic layerwise pruning approach that adaptively determines layer importance and achieves state-of-the-art results at high sparsity levels. SEFT, which introduces a sparse fine-tuning method that dynamically evolves the sparse topology of pruned models during fine-tuning, offering superior memory and time efficiency. LinearPatch, which presents a simple yet effective technique to revive layer-pruned LLMs by suppressing outliers and aligning activation magnitudes. Hopscotch, which identifies and skips redundant attention blocks in language models, preserving output quality while reducing computational costs. SkipGPT, which introduces a dynamic layer pruning framework that prioritizes critical tokens and decouples pruning policies for MLP and self-attention components, achieving significant parameter reduction while matching or exceeding original model performance.