Advancements in Distributed Deep Learning

The field of distributed deep learning is moving towards more efficient and scalable solutions. Recent developments focus on improving communication efficiency, reducing latency, and increasing throughput in large-scale model training. Notably, researchers are exploring novel approaches to overlap computation and communication, optimize data loading, and leverage asynchronous processing to accelerate training times. Additionally, there is a growing interest in developing frameworks and systems that can automatically identify the most efficient parallelism strategies and optimize hardware and software resources. These advancements have the potential to significantly improve the performance and scalability of deep learning models. Noteworthy papers include: Pseudo-Asynchronous Local SGD, which proposes a method to improve the efficiency of data-parallel training by reducing communication frequency. The Big Send-off introduces PCCL, a communication library that achieves substantial performance improvements over existing libraries for distributed deep learning workloads. ADAPTRA presents a straggler-resilient training system that optimally adapts the pipeline schedule to absorb communication delays. AGILE proposes a lightweight and efficient asynchronous library for GPU-SSD integration, achieving up to 1.88x improvement in workloads with different computation-to-communication ratios. Triton-distributed extends the Triton compiler to support native overlapping optimizations for distributed AI workloads. FlashOverlap introduces a lightweight design for efficiently overlapping communication and computation, achieving up to 1.65x speedup. Scalable and Performant Data Loading presents SPDL, an open-source library that improves data loading performance by up to 74%. Galvatron is a distributed system that automatically identifies the most efficient hybrid parallelism strategy for large-scale foundation model training. MCMComm proposes a hardware-software co-optimization framework for end-to-end communication in multi-chip-modules, achieving significant performance improvements for CNNs and vision transformers.

Sources

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

The Big Send-off: High Performance Collectives on GPU-based Supercomputers

Adaptra: Straggler-Resilient Hybrid-Parallel Training with Pipeline Adaptation

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation

Scalable and Performant Data Loading

Hetu v2: A General and Scalable Deep Learning System with Hierarchical and Heterogeneous Single Program Multiple Data Annotations

Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

MCMComm: Hardware-Software Co-Optimization for End-to-End Communication in Multi-Chip-Modules

GPRat: Gaussian Process Regression with Asynchronous Tasks

Built with on top of