Efficient Key-Value Cache Management in Large Language Models

The field of large language models (LLMs) is currently moving towards more efficient key-value cache management to optimize inference performance. Researchers are exploring various techniques such as runtime-adaptive pruning, prefix-aware attention, and query-agnostic cache compression to reduce memory overhead and improve speed. These innovations have the potential to significantly advance the field by enabling longer context lengths, increasing tokens per second throughput, and improving overall efficiency. Notable papers in this area include RAP, which proposes a reinforcement learning-based pruning framework, and FlashForge, which introduces a novel shared-prefix attention kernel. Additionally, Titanus and EFIM demonstrate the effectiveness of software-hardware co-design and transformed prompt formats in improving LLM serving efficiency. Other notable papers, such as Mustafar and KVzip, focus on promoting unstructured sparsity and query-agnostic cache compression, respectively. These advancements have the potential to pave the way for more efficient and scalable LLM deployments.

Sources

RAP: Runtime-Adaptive Pruning for LLM Inference

FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration

EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse

Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference

DINGO: Constrained Inference for Diffusion LLMs

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Built with on top of