Advancements in Efficient Distributed Training and Inference for Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving the efficiency of distributed training and inference. Recent developments have centered around optimizing communication patterns, reducing congestion and dilation, and increasing hardware utilization. Notable advancements include the use of photonic collective communication libraries, macro-to-micro flow transformation, and topology-aware communication alignment. These innovations have led to significant speedups in end-to-end training throughput and improved performance in various workloads. Furthermore, researchers have proposed novel scheduling systems, job atomization techniques, and pipeline-based approaches to enhance the efficiency of RL training and LLM inference.

Noteworthy papers include: PCCL, which achieves up to 3X speedup over state-of-the-art algorithms on 128 GPUs. RLinf, which consistently outperforms state-of-the-art systems, achieving 1.1x-2.13x speedup in end-to-end training throughput. APRIL, which improves rollout throughput by at most 44% and achieves at most 8% higher final accuracy across tasks. PipelineRL, which achieves approximately 2x faster learning compared to conventional RL baselines while maintaining highly on-policy training data. Gyges, which improves throughput by 1.75x-6.57x compared to state-of-the-art solutions. BurstEngine, which achieves a 1.2x speedup with much lower memory overhead than state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens.

Sources

PCCL: Photonic circuit-switched collective communication for distributed ML

RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation

Efficient Pre-Training of LLMs via Topology-Aware Communication Alignment on More Than 9600 GPUs

A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU

APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation

Scheduler-Driven Job Atomization

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generatio

Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference

BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens

Built with on top of