Advancements in Large Language Model Training and Inference

The field of large language model (LLM) training and inference is rapidly evolving, with a focus on improving efficiency, scalability, and reliability. Recent developments have centered around addressing the challenges of training large models with variable-length workloads, heterogeneous architectures, and decentralized environments. Researchers have proposed novel techniques such as hierarchical sequence partitioning, adaptive pipeline parallelism, and topology-aware sequence parallelism to mitigate load imbalance and communication overhead. Additionally, there has been a push towards developing more efficient and reliable collective communication libraries, unified scheduling systems, and serving platforms for LLMs. These advancements have led to significant improvements in training throughput, inference latency, and resource utilization. Notable papers in this area include Zeppelin, which achieves an average 2.80x speedup over state-of-the-art methods, and TASP, which achieves up to 3.58x speedup over Ring Attention. Other notable papers include AdaPtis, HAPT, Parallax, SlimPack, ElasWave, ICCL, Kant, and TetriServe, which have all made significant contributions to the field.

Sources

Zeppelin: Balancing Variable-length Workloads in Data Parallel Large Model Training

AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models

HAPT: Heterogeneity-Aware Automated Parallel Training on Heterogeneous Clusters

Parallax: Efficient LLM Inference Service over Decentralized Environment

SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training

TASP: Topology-aware Sequence Parallelism

Container Orchestration Patterns for Optimizing Resource Use

ElasWave: An Elastic-Native System for Scalable Hybrid-Parallel Training

An Efficient, Reliable and Observable Collective Communication Library in Large-scale GPU Training Clusters

Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters

TetriServe: Efficient DiT Serving for Heterogeneous Image Generation

Built with on top of