Accelerating Large Language Models with Speculative Decoding and Efficient Inference

The field of large language models (LLMs) is rapidly advancing with a focus on improving inference efficiency. Recent developments have centered around speculative decoding, which enables faster token generation by leveraging draft models and verifying their output. This approach has led to significant speedups, with some methods achieving up to 5.5x acceleration over standard decoding. Notably, techniques such as cascade adaptive self-speculative decoding, proxy-based test-time alignment, and diffusion-based drafting have shown promise in reducing latency and improving accuracy. Furthermore, research has explored the use of retrieval-enhanced methods, adaptive thresholding, and expert routing to optimize speculative decoding. The integration of these advancements has the potential to enable scalable and low-latency LLM services, particularly in resource-constrained settings. Some noteworthy papers in this area include CAS-Spec, which proposes a novel cascade adaptive self-speculative decoding method, and SpecDiff-2, which leverages discrete diffusion to address bottlenecks in speculative decoding. Additionally, ReSpec introduces a framework for retrieval-enhanced speculative decoding, while TapOut presents a bandit-based approach to dynamic speculative decoding.

Sources

CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs

Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

Reject Only Critical Tokens: Pivot-Aware Speculative Decoding

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding

Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

STARS: Segment-level Token Alignment with Rejection Sampling in Large Language Models

Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing

A Characterization of List Language Identification in the Limit

Optimal Inference Schedules for Masked Diffusion Models

Built with on top of