Advancements in Speculative Decoding for Large Language Models

The field of large language models is moving towards improving the efficiency of inference through speculative decoding. This involves using a small draft model to propose multiple tokens that a target model verifies in parallel, allowing for significant speedups in generation. Recent work has focused on extending this idea to batches, addressing the challenges of ragged tensors and synchronization requirements. Notably, researchers have introduced novel frameworks and techniques that characterize the optimal inference time for multi-model speculative decoding systems and optimize the interplay between model capabilities, acceptance lengths, and computational cost. These advancements have led to significant improvements in throughput and efficiency, with some approaches achieving speedup ratios of up to 5.2 times faster than conventional decoding methods. Noteworthy papers include: Batch Speculative Decoding Done Right, which presents a correctness-first batch speculative decoding approach that achieves up to 3x throughput improvement. ReSpec, which adapts speculative decoding to reinforcement learning systems, achieving up to 4.5x speedup while preserving reward convergence and training stability. Polybasic Speculative Decoding Through a Theoretical Perspective, which introduces a novel polybasic speculative decoding framework that yields speedup ratios ranging from 3.31x to 4.43x for various models.

Advancements in Speculative Decoding for Large Language Models

Sources