The field of large language models (LLMs) is moving towards more efficient inference methods to alleviate the memory bottleneck and reduce latency. Researchers are exploring various techniques such as context-aware cache compression, overlapping encoding and prefill, sparse attention, and semantic-aware cache sharing. These innovations aim to improve the performance of LLMs while minimizing the computational resources required.
Noteworthy papers in this area include:
- OjaKV, which introduces a novel framework for online subspace adaptation to maintain high-fidelity anchors for attention and achieve high compression ratios.
- SparseServe, which proposes a hierarchical HBM-DRAM management system to unlock the parallel potential of dynamic sparse attention algorithms.
- SemShareKV, which accelerates LLM inference by reusing KVCache in semantically similar prompts using fuzzy token matching and Rotary Position Embedding.
- StorInfer, which presents a storage-assisted LLM inference system that precomputes and stores predictable query-response pairs offline to reduce latency and compute costs.
- Expected Attention, which estimates KV pairs importance by predicting how future queries will attend to them, enabling principled ranking and pruning of KV pairs with minimal impact on the residual stream.
- ThinKV, which proposes a thought-adaptive KV cache compression framework that assigns token precision by thought importance and progressively evicts tokens from less critical thoughts as reasoning trajectories evolve.