Advancements in Large Language Models and Pretraining Methods

The field of natural language processing is witnessing significant advancements in large language models and pretraining methods. Researchers are exploring innovative approaches to improve the performance and efficiency of these models, including the use of synthetic data, curriculum learning, and dynamic vocabulary selection. These developments have the potential to enhance the capabilities of language models in various applications, such as text generation, language understanding, and ordinal classification. Noteworthy papers in this area include BeyondWeb, which introduces a synthetic data generation framework that outperforms state-of-the-art synthetic pretraining datasets, and Nemotron-CC-Math, which presents a high-quality mathematical corpus constructed from Common Crawl using a novel pipeline. Other notable works include VocabTailor, which proposes a dynamic vocabulary selection framework for small language models, and Influence-driven Curriculum Learning, which investigates the use of training data influence as a difficulty metric for curriculum learning.

Sources

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Learning In-context $\pmb{n}$-grams with Transformers: Sub-$\pmb{n}$-grams Are Near-stationary Points

CLoE: Curriculum Learning on Endoscopic Images for Robust MES Classification

GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation

Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Influence-driven Curriculum Learning for Pre-training on Limited Data

Built with on top of