Efficient Serving of Large Language Models

The field of large language models (LLMs) is moving towards more efficient serving strategies, with a focus on reducing computational resources and improving user experience. Researchers are exploring innovative approaches to optimize LLM serving, including adaptive streaming methods, proactive intra- and inter-instance orchestration, and dynamic spatial-temporal orchestration. These advancements aim to address the mismatch between computational budget and actual human reading speeds, as well as the inefficient use of computational resources. Noteworthy papers in this area include EcoServe, which enables cost-effective LLM serving on clusters with commodity interconnects, and Bullet, which boosts GPU utilization for LLM serving via dynamic spatial-temporal orchestration. Additionally, Tempo introduces a systematic SLO-aware scheduler to maximize service gain across diverse LLM workloads, and semi-PD proposes a novel LLM serving system with disaggregated computation and unified storage.

Sources

Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration

Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration

Taming the Titans: A Survey of Efficient LLM Inference Serving

semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage

Tempo: Application-aware LLM Serving with Mixed SLO Requirements

Thoughtful, Confused, or Untrustworthy: How Text Presentation Influences Perceptions of AI Writing Tools

GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection

Ascendra: Dynamic Request Prioritization for Efficient LLM Serving

Scaling On-Device GPU Inference for Large Generative Models

Built with on top of