Advances in Data Selection and Curation for Large Language Models

The field of natural language processing is moving towards more efficient and effective data selection and curation methods for large language models. Recent research has highlighted the importance of considering the computational budget and dynamic nature of sample influence during optimization. Innovative approaches, such as layer-aware online estimators and computational budget-aware data selection methods, have shown promising results in improving accuracy and reducing time and memory costs. Additionally, there is a growing focus on diversity-driven data selection methods that prioritize both quality and diversity, ensuring broad coverage and distributional heterogeneity. Noteworthy papers in this area include: Layer-Aware Influence for Online Data Valuation Estimation, which develops a layer-aware online estimator for efficient data valuation. Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning, which proposes a framework for efficient online batch selection that captures both data utility and intra-sample diversity. Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection, which presents an orthogonal diversity-aware selection algorithm that preserves both quality and diversity during data selection.

Sources

Layer-Aware Influence for Online Data Valuation Estimation

Contrasting the Hyperparameter Tuning Impact Across Software Defect Prediction Scenarios

Computational Budget Should Be Considered in Data Selection

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

What Does It Take to Build a Performant Selective Classifier?

An Empirical Study of Sample Selection Strategies for Large Language Model Repair

Built with on top of