The field of AI is witnessing significant advancements in inference disaggregation and load balancing, driven by the need for improved throughput and interactivity in large-scale deployments. Researchers are exploring novel architectures and techniques to optimize the performance of large language models and other AI workloads. A key direction is the development of temporally-disaggregated pipeline parallelism, which aims to eliminate pipeline bubbles and improve throughput. Another area of focus is congestion-aware path selection for load balancing in AI clusters, which can significantly reduce flow completion times. Noteworthy papers include:
- BestServe, which presents a novel framework for ranking serving strategies by estimating goodput under various operating scenarios, and
- TD-Pipe, which proposes a temporally-disaggregated pipeline parallelism architecture for high-throughput LLM inference, achieving significant throughput improvements over existing approaches.