Advancements in Language Model Training and Data Curation

The field of language model training is moving towards a more nuanced understanding of the role of data quality and curation in improving model performance. Recent studies have highlighted the importance of strategically allocating data across the entire training pipeline, with a focus on front-loading reasoning data into pretraining and using high-quality data to establish durable foundations for later fine-tuning. The development of new scaling laws and metrics, such as the error-entropy scaling law and Spectral Alignment, are providing more accurate descriptions of model behavior and enabling earlier detection of training divergence. Noteworthy papers include:

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data, which establishes the critical role of front-loading reasoning data into pretraining.
Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining, which introduces a quality-aware scaling law to predict loss as a joint function of model size, data volume, and data quality.
What Scales in Cross-Entropy Scaling Law?, which decomposes cross-entropy into three parts and finds that only error-entropy follows a robust power-law scaling.

Advancements in Language Model Training and Data Curation

Sources