Efficient Serving of Large Language Models

The field of large language models (LLMs) is moving towards more efficient serving strategies, with a focus on reducing computational resources and improving user experience. Researchers are exploring innovative approaches to optimize LLM serving, including adaptive streaming methods, proactive intra- and inter-instance orchestration, and dynamic spatial-temporal orchestration. These advancements aim to address the mismatch between computational budget and actual human reading speeds, as well as the inefficient use of computational resources. Noteworthy papers in this area include EcoServe, which enables cost-effective LLM serving on clusters with commodity interconnects, and Bullet, which boosts GPU utilization for LLM serving via dynamic spatial-temporal orchestration. Additionally, Tempo introduces a systematic SLO-aware scheduler to maximize service gain across diverse LLM workloads, and semi-PD proposes a novel LLM serving system with disaggregated computation and unified storage.

Efficient Serving of Large Language Models

Sources