Advances in Data Selection and Curation for Large Language Models

The field of natural language processing is moving towards more efficient and effective data selection and curation methods for large language models. Recent research has highlighted the importance of considering the computational budget and dynamic nature of sample influence during optimization. Innovative approaches, such as layer-aware online estimators and computational budget-aware data selection methods, have shown promising results in improving accuracy and reducing time and memory costs. Additionally, there is a growing focus on diversity-driven data selection methods that prioritize both quality and diversity, ensuring broad coverage and distributional heterogeneity. Noteworthy papers in this area include: Layer-Aware Influence for Online Data Valuation Estimation, which develops a layer-aware online estimator for efficient data valuation. Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning, which proposes a framework for efficient online batch selection that captures both data utility and intra-sample diversity. Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection, which presents an orthogonal diversity-aware selection algorithm that preserves both quality and diversity during data selection.

Advances in Data Selection and Curation for Large Language Models

Sources