Efficient Compression Techniques for Large Language Models

The field of Large Language Models (LLMs) is moving towards developing innovative compression techniques to mitigate the significant memory challenges and computational requirements associated with these models. Researchers are exploring various approaches, including attention behavior-based methods, cross-layer parameter sharing, and low-rank decomposition, to reduce the size of LLMs while maintaining their performance. These techniques have shown promising results, enabling the creation of smaller, more efficient LLMs that can be practically deployed in real-world applications. Noteworthy papers in this area include:

  • SurfaceLogicKV, which proposes a novel two-stage method to utilize attention behaviors for KV Cache compression, achieving improved compressing robustness while maintaining competitive performance.
  • CommonKV, which introduces a training-free method for cross-layer KV cache compression through adjacent parameters sharing, resulting in a more easily mergeable latent KV cache.
  • CALR, which combines a primary path of SVD-compressed layers with a parallel, learnable, low-rank corrective module to recover the functional residual error, enabling the creation of significantly smaller, more efficient LLMs.

Sources

SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression

CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing

CALR: Corrective Adaptive Low-Rank Decomposition for Efficient Large Language Model Layer Compression

Learned Structure in CARTRIDGES: Keys as Shareable Routers in Self-Studied Representations

Built with on top of