Advances in Machine Learning Optimization

The field of machine learning is experiencing significant advancements in optimization techniques, leading to improved training efficiency and performance. Researchers are exploring new methods for optimizing learning rates, such as identifying proportionality between learning rates and dataset sizes, and designing cumulative learning constants. Additionally, optimal control theory is being applied to transform architectures, resulting in enhanced generalization, robustness, and efficiency. Studies on scaling laws for hyperparameters, such as weight decay and batch size, are providing valuable insights into large language model pre-training. Novel optimization algorithms, like AdamS, are being developed, offering alternatives to traditional optimizers. Noteworthy papers in this area include:

  • Optimal Control for Transformer Architectures, which improves transformer performance using optimal control theory.
  • AdamS, a simple yet effective alternative to Adam for large language model pretraining and post-training, offering superior optimization performance and efficiency.

Sources

Tuning Learning Rates with the Cumulative-Learning Constant

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Built with on top of