Efficient Inference in Large Language Models

The field of large language models (LLMs) is moving towards improving inference efficiency, with a focus on optimizing Key-Value (KV) cache management and reducing computational demands. Researchers are exploring innovative techniques such as dynamic token pruning, expert-sharded KV storage, and semantic caching to enhance performance and scalability. These advancements aim to address the memory bottleneck and latency issues that arise from the increasing context length and model size. Noteworthy papers in this area include: SlimInfer, which proposes a dynamic fine-grained pruning mechanism to accelerate inference, and PiKV, a parallel and distributed KV cache serving framework tailored for Mixture of Experts architecture. LP-Spec is also notable for its architecture-dataflow co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture to accelerate LLM speculative inference. Other notable works include CATP, FIER, and RetroAttention, which introduce contextually adaptive token pruning, fine-grained KV cache retrieval, and retrospective sparse attention techniques, respectively.

Sources

KV Cache Compression for Inference Efficiency in LLMs: A Review

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

PiKV: KV Cache Management System for Mixture of Experts

LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture-Dataflow Co-Optimization

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

Retrospective Sparse Attention for Efficient Long-Context Generation

Built with on top of