Efficient Key-Value Cache Management for Large Language Models

The field of large language models (LLMs) is moving towards more efficient key-value (KV) cache management to reduce redundant computation and improve inference performance. Recent developments have focused on innovative lossy compression techniques, graph-based eviction strategies, and adaptive cache compression methods. These advancements aim to minimize loading delays, reduce memory footprints, and maintain high generation quality. Notable papers in this area include AdaptCache, which achieves significant delay savings and quality improvements through lossy KV cache compression, and GraphKV, which introduces a graph-based framework for token selection and adaptive retention. Other noteworthy papers are KVComp, which presents a high-performance, LLM-aware lossy compression framework, and EvolKV, which proposes an evolutionary framework for layer-wise, task-driven KV cache compression.

Sources

AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving

GraphKV: Breaking the Static Selection Paradigm with Graph-Based KV Cache Eviction

KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache

STZ: A High Quality and High Speed Streaming Lossy Compression Framework for Scientific Data

Adaptive KV-Cache Compression without Manually Setting Budget

Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing

PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

KVCompose: Efficient Structured KV Cache Compression with Composite Tokens

Waltz: Temperature-Aware Cooperative Compression for High-Performance Compression-Based CSDs

EvolKV: Evolutionary KV Cache Compression for LLM Inference

Built with on top of