Efficient Training and Inference of Large Language Models

The field of large language models (LLMs) is witnessing significant advancements in efficient training and inference methods. Researchers are focusing on developing novel techniques to reduce computational costs, memory usage, and energy consumption while maintaining or improving model performance. One notable direction is the development of dynamic workload reduction schemes, such as Mixture of Experts (MoEs) and sparse attention, which can lead to substantial reductions in computational costs. However, these methods often introduce workload imbalances, emphasizing the need for autonomous dynamic load balancing solutions. Another important area of research is the development of efficient model editing methods, which enable precise knowledge updates in LLMs. Recent studies have proposed innovative approaches to address the challenges of sequential editing, including the use of queuing theory and Lyapunov optimization to ensure long-term knowledge preservation. Furthermore, researchers are exploring the potential of contextual sparsity, which can lead to significant speedups in LLM inference. Noteworthy papers in this area include TokenWeave, which proposes a novel Token-Splitting technique to overlap computation and communication, and MegaScale-MoE, which presents a production system for efficient training of large-scale MoE models. Additionally, EfficientLLM provides a comprehensive empirical study evaluating efficiency techniques for LLMs, while ULTRAEDIT introduces a fundamentally new editing solution that is training-, subject-, and memory-free. Other notable papers include Polar Sparsity, which demonstrates the effectiveness of contextual sparsity in scaling to large batch sizes, and LyapLock, which proposes a framework with rigorous theoretical guarantees for bounded knowledge preservation in sequential editing.

Efficient Training and Inference of Large Language Models

Sources