Accelerating Large Language Models

The field of large language models (LLMs) is moving towards improving inference efficiency and reducing computational costs. Recent developments focus on accelerating LLMs through speculative decoding, pipeline parallelism, and model optimization. Notable advancements include the proposal of novel decoding methods that reduce the time-consuming data transfers in forward passes, and the development of techniques that enable robust long-context generalization.

Noteworthy papers include: Mamba-2 audio captioning, which presents a systematic exploration of the design space for audio captioning models. SpecMamba, which proposes an FPGA-based accelerator for Mamba with speculative decoding, achieving a 2.27x speedup over GPU baselines. FastMTP, which introduces a simple yet effective method for improving multi-step draft quality, enabling an average of 2.03x speedup compared to standard next token prediction. Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding, which proposes a pipeline-parallel self-speculative decoding method that achieves state-of-the-art acceleration in self-speculative LLM inference.

Sources

Mamba-2 audio captioning: design space exploration and analysis

Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Mamba Modulation: On the Length Generalization of Mamba

SpecMamba: Accelerating Mamba Inference on FPGA with Speculative Decoding

Built with on top of