Advancements in Data-Constrained Pretraining and Synthetic Data for LLMs

The field of large language models (LLMs) is moving towards more efficient and effective pretraining methods, particularly in data-constrained settings. Researchers are exploring innovative approaches to optimize pretraining data, such as curriculum learning, data augmentation via simplification, and predicting training re-evaluation curves. These methods aim to improve representation quality, fine-tuning, and zero-shot performance of LLMs. Additionally, there is a growing interest in synthetic data techniques to sidestep the limitations of high-quality data supply. Studies have investigated the benefits and pitfalls of synthetic data, including scaling laws, and have found that mixing natural and synthetic data can speed up pretraining. Noteworthy papers in this area include: Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs, which introduces a diagnostic to retrospectively evaluate training batches and predict optimal data placement. Paired by the Teacher, which presents a two-stage teacher-student pipeline to synthesize accurate input-output pairs without human labels or parallel data, achieving state-of-the-art results on several benchmarks.

Advancements in Data-Constrained Pretraining and Synthetic Data for LLMs

Sources