Optimization and Scaling in Large Language Model Pretraining

The field of large language model pretraining is moving towards more efficient and scalable methods. Researchers are exploring new optimization techniques, such as distillation and layer-wise scaling, to improve the performance of these models. The use of surrogate benchmarks and systematic evaluations is also becoming increasingly important for comparing and selecting optimizers. Furthermore, the development of more efficient training pipelines and the optimization of GPU usage are crucial for reducing training times and costs. Noteworthy papers in this area include: Benchmarking Optimizers for Large Language Model Pretraining, which provides a comprehensive evaluation of recent optimization techniques. Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling, which explores the effects of distillation on test-time scaling and in-context learning. Fantastic Pretraining Optimizers and Where to Find Them, which conducts a systematic study of ten deep learning optimizers and finds that matrix-based optimizers are among the fastest. Surrogate Benchmarks for Model Merging Optimization, which develops surrogate benchmarks for optimization of merging hyperparameters. Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial, which analyzes the efficiency of GPU usage in training large-scale deep learning models. Scaling Performance of Large Language Model Pretraining, which aims to demystify the large language model pretraining pipeline and provides practical recommendations for tuning training performance. Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training, which introduces new layer-wise scaling variants for pre-training and presents a systematic ablation study.

Optimization and Scaling in Large Language Model Pretraining

Sources