Optimizing Large Language Model Serving

The field of Large Language Model (LLM) serving is rapidly advancing, with a focus on improving efficiency, scalability, and sustainability. Recent developments have centered around optimizing parallelization strategies, memory management, and energy efficiency. Researchers are exploring innovative approaches to co-optimize parallelism degrees and per-operator sharding dimensions, as well as developing systems that can efficiently serve LLMs in heterogeneous GPU clusters. Notable papers in this area include:

Learn to Shard, which achieves up to 3.5x throughput improvement over metaheuristic baselines.
Halo, which introduces batch query processing and optimization for agentic LLM workflows, achieving up to 18.6x speedup for batch inference.
VoltanaLLM, which demonstrates up to 36.3% energy savings while maintaining near-perfect SLO attainment rate.
FineServe, which proposes a precision-aware KV slab and two-level scheduling framework for mixed-precision LLM serving, achieving up to 2.2x higher SLO attainment and 1.8x higher token generation throughput.
MaaSO, which improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches.
Hetis, which employs a fine-grained and dynamic parallelism design to improve serving throughput by up to 2.25x and reduce latency by 1.49x compared to existing systems.

Optimizing Large Language Model Serving

Sources