The field of distributed optimization and stochastic gradient descent is moving towards more efficient and scalable methods. Researchers are exploring new techniques to reduce communication overhead, improve convergence rates, and adapt to dynamic environments. Notably, the use of decentralized optimization methods, such as Local SGD and decentralized data parallel training, is becoming increasingly popular. These methods have been shown to achieve comparable or even better performance than traditional centralized approaches. Additionally, there is a growing interest in developing novel optimizers, such as LiMuon and stochastic bilevel optimization methods, that can handle large models and non-convex objectives. Overall, the field is advancing towards more robust, efficient, and scalable optimization methods. Noteworthy papers include:
- Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration, which provides new insights into the role of outer optimizers in Local SGD and proposes novel methods for tuning the outer learning rate.
- Scaling Up Data Parallelism in Decentralized Deep Learning, which introduces a benchmarking framework and a decentralized adaptive approach for large-scale DNN training.
- LiMuon: Light and Fast Muon Optimizer for Large Models, which proposes a novel optimizer with lower memory and sample complexity for training large models.