Efficient Scaling of Large Language Models

The field of large language models is moving towards efficient scaling, with a focus on reducing computational costs and improving inference speeds. Researchers are exploring various methods to achieve this, including progressive training, lossless parallel tokenization, and iterative layer-wise distillation. These approaches aim to preserve the performance of large models while significantly reducing their computational requirements. Noteworthy papers in this area include: Deep Progressive Training, which proposes a zero/one-layer progressive training method for optimal tradeoff between computation and loss. LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Iterative Layer-wise Distillation for Efficient Compression of Large Language Models, which develops an improved method based on the ShortGPT approach. Attention and Compression is all you need for Controllably Efficient Language Models, which proposes Compress & Attend Transformer (CAT), a conceptually simple architecture employing dense attention and compression. A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code, which proposes MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity. Schedulers for Schedule-free, which extends the last-iterate convergence theory of schedule-free to allow for any scheduler. Information Capacity, which introduces information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Sentence-Anchored Gist Compression for Long-Context LLMs, which investigates context compression for Large Language Models (LLMs) using learned compression tokens. Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition, which proposes a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism.

Sources

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

Attention and Compression is all you need for Controllably Efficient Language Models

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Schedulers for Schedule-free: Theoretically inspired hyperparameters

Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

Sentence-Anchored Gist Compression for Long-Context LLMs

Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

Built with on top of