Advancements in Language Model Training and Data Curation

The field of language model training is moving towards a more nuanced understanding of the role of data quality and curation in improving model performance. Recent studies have highlighted the importance of strategically allocating data across the entire training pipeline, with a focus on front-loading reasoning data into pretraining and using high-quality data to establish durable foundations for later fine-tuning. The development of new scaling laws and metrics, such as the error-entropy scaling law and Spectral Alignment, are providing more accurate descriptions of model behavior and enabling earlier detection of training divergence. Noteworthy papers include:

  • Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data, which establishes the critical role of front-loading reasoning data into pretraining.
  • Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining, which introduces a quality-aware scaling law to predict loss as a joint function of model size, data volume, and data quality.
  • What Scales in Cross-Entropy Scaling Law?, which decomposes cross-entropy into three parts and finds that only error-entropy follows a robust power-law scaling.

Sources

Market-Based Data Subset Selection -- Principled Aggregation of Multi-Criteria Example Utility

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining

What Scales in Cross-Entropy Scaling Law?

Exploring Instruction Data Quality for Explainable Image Quality Assessment

Spectral Alignment as Predictor of Loss Explosion in Neural Network Training

From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Mid-Training of Large Language Models: A Survey

More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Built with on top of