Optimizing Large Language Model Serving

The field of Large Language Model (LLM) serving is rapidly advancing, with a focus on improving efficiency, scalability, and fairness. Recent developments have centered around optimizing scheduling, scaling, and resource allocation to meet the diverse requirements of LLM workloads. Innovations in algorithmic and system-level designs have enabled significant gains in performance, energy efficiency, and cost-effectiveness. Notably, researchers have proposed novel frameworks and techniques to address the challenges of serving LLMs, including proactive SLO compliance, dynamic frequency scaling, and holistic fair scheduling. These advancements have the potential to transform the field of LLM serving, enabling more efficient, responsive, and fault-tolerant systems. Some noteworthy papers in this area include: HyperFlexis, which presents a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. GreenLLM, which introduces an SLO-aware serving framework that minimizes GPU energy by explicitly separating prefill and decode control. Equinox, which addresses the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. Taming the Chaos, which introduces HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. MERIT, which proposes a novel optimizer that leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively.

Sources

HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

GreenLLM: SLO-Aware Dynamic Frequency Scaling for Energy-Efficient LLM Serving

Equinox: Holistic Fair Scheduling in Serving Large Language Models

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models

Managing Multi Instance GPUs for High Throughput and Energy Savings

CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Predictable LLM Serving on GPU Clusters

MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training

Built with on top of