Sustainable and Efficient Large Language Model Serving

The field of large language model (LLM) serving is moving towards more sustainable and efficient solutions. Researchers are exploring innovative approaches to reduce the environmental impact of LLMs, such as adaptive cache management and dynamic precision adaptation. These approaches aim to minimize carbon emissions while maintaining or improving the performance of LLMs. Notable papers in this area include EmbAdvisor, which reduces average carbon emissions by 9.5%, and NestedFP, which enables seamless FP8 and FP16 inference from a single 16-bit model representation, delivering up to 1.55x throughput improvement in FP8 mode. Additionally, WaveLink, a serverless system, achieves 35% faster end-to-end performance at a comparable cost to incompatible systems, while MorphServe, a dynamic LLM serving framework, reduces average SLO violations by 92.45% and improves P95 TTFT latency by 2.2x-3.9x compared to full-precision serving.

Sustainable and Efficient Large Language Model Serving

Sources