Efficient Inference for Large Language Models

The field of large language models (LLMs) is moving towards more efficient inference methods to alleviate the memory bottleneck and reduce latency. Researchers are exploring various techniques such as context-aware cache compression, overlapping encoding and prefill, sparse attention, and semantic-aware cache sharing. These innovations aim to improve the performance of LLMs while minimizing the computational resources required.

Noteworthy papers in this area include:

  • OjaKV, which introduces a novel framework for online subspace adaptation to maintain high-fidelity anchors for attention and achieve high compression ratios.
  • SparseServe, which proposes a hierarchical HBM-DRAM management system to unlock the parallel potential of dynamic sparse attention algorithms.
  • SemShareKV, which accelerates LLM inference by reusing KVCache in semantically similar prompts using fuzzy token matching and Rotary Position Embedding.
  • StorInfer, which presents a storage-assisted LLM inference system that precomputes and stores predictable query-response pairs offline to reduce latency and compute costs.
  • Expected Attention, which estimates KV pairs importance by predicting how future queries will attend to them, enabling principled ranking and pruning of KV pairs with minimal impact on the residual stream.
  • ThinKV, which proposes a thought-adaptive KV cache compression framework that assigns token precision by thought importance and progressively evicts tokens from less critical thoughts as reasoning trajectories evolve.

Sources

OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule

RServe: Overlapping Encoding and Prefill for Efficient LMM Inference

SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving

SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching

Accelerating LLM Inference with Precomputed Query Storage

The Pitfalls of KV Cache Compression

Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Built with on top of