The field of Large Language Model (LLM) serving is rapidly advancing, with a focus on improving efficiency, scalability, and sustainability. Recent developments have centered around optimizing parallelization strategies, memory management, and energy efficiency. Researchers are exploring innovative approaches to co-optimize parallelism degrees and per-operator sharding dimensions, as well as developing systems that can efficiently serve LLMs in heterogeneous GPU clusters. Notable papers in this area include:
- Learn to Shard, which achieves up to 3.5x throughput improvement over metaheuristic baselines.
- Halo, which introduces batch query processing and optimization for agentic LLM workflows, achieving up to 18.6x speedup for batch inference.
- VoltanaLLM, which demonstrates up to 36.3% energy savings while maintaining near-perfect SLO attainment rate.
- FineServe, which proposes a precision-aware KV slab and two-level scheduling framework for mixed-precision LLM serving, achieving up to 2.2x higher SLO attainment and 1.8x higher token generation throughput.
- MaaSO, which improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches.
- Hetis, which employs a fine-grained and dynamic parallelism design to improve serving throughput by up to 2.25x and reduce latency by 1.49x compared to existing systems.