Sustainable and Efficient Large Language Model Serving

The field of large language model (LLM) serving is moving towards more sustainable and efficient solutions. Researchers are exploring innovative approaches to reduce the environmental impact of LLMs, such as adaptive cache management and dynamic precision adaptation. These approaches aim to minimize carbon emissions while maintaining or improving the performance of LLMs. Notable papers in this area include EmbAdvisor, which reduces average carbon emissions by 9.5%, and NestedFP, which enables seamless FP8 and FP16 inference from a single 16-bit model representation, delivering up to 1.55x throughput improvement in FP8 mode. Additionally, WaveLink, a serverless system, achieves 35% faster end-to-end performance at a comparable cost to incompatible systems, while MorphServe, a dynamic LLM serving framework, reduces average SLO violations by 92.45% and improves P95 TTFT latency by 2.2x-3.9x compared to full-precision serving.

Sources

EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving

Melding the Serverless Control Plane with the Conventional Cluster Manager for Speed and Compatibility

Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing

NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs

KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Inference economics of language models

Kinetics: Rethinking Test-Time Scaling Laws

Built with on top of