Efficient Inference in Large Language Models

The field of large language models (LLMs) is moving towards improving inference efficiency, with a focus on optimizing Key-Value (KV) cache management and reducing computational demands. Researchers are exploring innovative techniques such as dynamic token pruning, expert-sharded KV storage, and semantic caching to enhance performance and scalability. These advancements aim to address the memory bottleneck and latency issues that arise from the increasing context length and model size. Noteworthy papers in this area include: SlimInfer, which proposes a dynamic fine-grained pruning mechanism to accelerate inference, and PiKV, a parallel and distributed KV cache serving framework tailored for Mixture of Experts architecture. LP-Spec is also notable for its architecture-dataflow co-design leveraging hybrid LPDDR5 performance-enhanced PIM architecture to accelerate LLM speculative inference. Other notable works include CATP, FIER, and RetroAttention, which introduce contextually adaptive token pruning, fine-grained KV cache retrieval, and retrospective sparse attention techniques, respectively.

Efficient Inference in Large Language Models

Sources