The field of large language models (LLMs) is rapidly advancing, with a focus on improving the efficiency of distributed training and inference. Recent developments have centered around optimizing communication patterns, reducing congestion and dilation, and increasing hardware utilization. Notable advancements include the use of photonic collective communication libraries, macro-to-micro flow transformation, and topology-aware communication alignment. These innovations have led to significant speedups in end-to-end training throughput and improved performance in various workloads. Furthermore, researchers have proposed novel scheduling systems, job atomization techniques, and pipeline-based approaches to enhance the efficiency of RL training and LLM inference.
Noteworthy papers include: PCCL, which achieves up to 3X speedup over state-of-the-art algorithms on 128 GPUs. RLinf, which consistently outperforms state-of-the-art systems, achieving 1.1x-2.13x speedup in end-to-end training throughput. APRIL, which improves rollout throughput by at most 44% and achieves at most 8% higher final accuracy across tasks. PipelineRL, which achieves approximately 2x faster learning compared to conventional RL baselines while maintaining highly on-policy training data. Gyges, which improves throughput by 1.75x-6.57x compared to state-of-the-art solutions. BurstEngine, which achieves a 1.2x speedup with much lower memory overhead than state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens.