Optimizing Large Language Model Serving

The field of Large Language Model (LLM) serving is rapidly advancing, with a focus on improving efficiency, scalability, and sustainability. Recent developments have centered around optimizing parallelization strategies, memory management, and energy efficiency. Researchers are exploring innovative approaches to co-optimize parallelism degrees and per-operator sharding dimensions, as well as developing systems that can efficiently serve LLMs in heterogeneous GPU clusters. Notable papers in this area include:

  • Learn to Shard, which achieves up to 3.5x throughput improvement over metaheuristic baselines.
  • Halo, which introduces batch query processing and optimization for agentic LLM workflows, achieving up to 18.6x speedup for batch inference.
  • VoltanaLLM, which demonstrates up to 36.3% energy savings while maintaining near-perfect SLO attainment rate.
  • FineServe, which proposes a precision-aware KV slab and two-level scheduling framework for mixed-precision LLM serving, achieving up to 2.2x higher SLO attainment and 1.8x higher token generation throughput.
  • MaaSO, which improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches.
  • Hetis, which employs a fine-grained and dynamic parallelism design to improve serving throughput by up to 2.25x and reduce latency by 1.49x compared to existing systems.

Sources

Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference

Batch Query Processing and Optimization for Agentic Workflows

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

MaaSO: SLO-aware Orchestration of Heterogeneous Model Instances for MaaS

Hetis: Serving LLMs in Heterogeneous GPU Clusters with Fine-grained and Dynamic Parallelism

Built with on top of