Advancements in Data-Constrained Pretraining and Synthetic Data for LLMs

The field of large language models (LLMs) is moving towards more efficient and effective pretraining methods, particularly in data-constrained settings. Researchers are exploring innovative approaches to optimize pretraining data, such as curriculum learning, data augmentation via simplification, and predicting training re-evaluation curves. These methods aim to improve representation quality, fine-tuning, and zero-shot performance of LLMs. Additionally, there is a growing interest in synthetic data techniques to sidestep the limitations of high-quality data supply. Studies have investigated the benefits and pitfalls of synthetic data, including scaling laws, and have found that mixing natural and synthetic data can speed up pretraining. Noteworthy papers in this area include: Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs, which introduces a diagnostic to retrospectively evaluate training batches and predict optimal data placement. Paired by the Teacher, which presents a two-stage teacher-student pipeline to synthesize accurate input-output pairs without human labels or parallel data, achieving state-of-the-art results on several benchmarks.

Sources

Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

The data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining

RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Built with on top of