Advancements in Large Language Model Serving and Inference

The field of large language models (LLMs) is rapidly evolving, with a focus on improving serving and inference efficiency. Recent developments have centered around optimizing cache utilization, dynamic pipeline reconfiguration, and heterogeneous resource allocation. These innovations aim to address the challenges posed by highly variable request patterns, severe resource fragmentation, and the need for high throughput and resource efficiency in AI infrastructure. Notable advancements include the integration of DPUs with cloud gateways, the development of novel KV caching solutions, and the introduction of dynamic orchestration frameworks for balancing disaggregated LLM serving. These advancements have resulted in significant performance gains, with some systems achieving up to 15x improvement in throughput and 70% reduction in latency.

Noteworthy papers include: LMCache, which presents an efficient KV caching solution for enterprise-scale LLM inference, achieving up to 15x improvement in throughput. Zephyrus, which introduces a DPU-augmented hierarchical co-offloading architecture for scaling gateways beyond the petabit-era, outperforming existing systems with 33% higher throughput and 21% lower power consumption. FlexPipe, which dynamically reconfigures pipeline architectures for efficient LLM serving, achieving up to 8.5x better resource efficiency and 38.3% lower latency. KVCOMM, which enables efficient prefilling in multi-agent inference by reusing KV-caches, achieving over 70% reuse rate and up to 7.8x speedup. BanaServe, which introduces a dynamic orchestration framework for balancing disaggregated LLM serving, achieving 1.2x-3.9x higher throughput and 3.9%-78.4% lower total processing time. xLLM, which presents an intelligent and efficient LLM inference framework, delivering significantly superior performance and resource efficiency, with up to 1.7x and 2.2x higher throughput than existing systems.

Sources

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Zephyrus: Scaling Gateways Beyond the Petabit-Era with DPU-Augmented Hierarchical Co-Offloading

An Explorative Study on Distributed Computing Techniques in Training and Inference of Large Language Models

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

GeoPipe: a Geo-distributed LLM Training Framework with enhanced Pipeline Parallelism in a Lossless RDMA-enabled Datacenter Optical Transport Network

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure

Efficiently Executing High-throughput Lightweight LLM Inference Applications on Heterogeneous Opportunistic GPU Clusters with Pervasive Context Management

Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving

xLLM Technical Report

Built with on top of