Accelerating Large Language Model Inference

The field of large language model (LLM) inference is rapidly advancing, with a focus on improving efficiency and reducing latency. Recent developments have centered around speculative decoding, which leverages smaller draft models to propose tokens for verification by larger target models. This approach has shown significant promise in accelerating inference, with various techniques being proposed to optimize the process, such as adaptive rescheduling, dynamic speculative sampling, and fairness-aware batch formation. Notably, the use of intermediate qualifier models, context-dependent dynamic shortlisting, and parallel heterogeneous execution have been explored to further enhance speculative decoding efficiency. These innovations have the potential to significantly improve the performance of LLMs, enabling faster and more accurate inference. Noteworthy papers include: Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding, which proposes a sparse quantize-and-sample framework to improve end-to-end latency and rejection rates. Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference, which launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix to exploit cross-device parallelism.

Sources

Conformal Sparsification for Bandwidth-Efficient Edge-Cloud Speculative Decoding

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Direct Multi-Token Decoding

Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

3-Model Speculative Decoding

Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Adaptive Rescheduling in Prefill-Decode Disaggregated LLM Inference

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

FairBatching: Fairness-Aware Batch Formation for LLM Inference

Built with on top of