The field of large language model inference is moving towards accelerating inference speed while maintaining competitive accuracy. Recent developments focus on speculative decoding techniques, which utilize otherwise idle computational resources to improve overall prediction accuracy. These techniques are being adapted to various scenarios, including time-series forecasting, generative recommendations, and large-batch scenarios. Researchers are also exploring distributed speculative decoding solutions for edge-cloud environments, enabling agile and scalable large language model serving. Notable papers in this area include:
- Accelerating Time Series Foundation Models with Speculative Decoding, which proposes a general inference acceleration framework for autoregressive time-series models.
- NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations, which achieves hyperspeed decoding without sacrificing recommendation quality.
- Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios, which proposes a novel architecture that integrates unidirectional and bidirectional attention mechanisms.
- DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving, which extends speculative decoding to multi-device deployments through coordinated draft-target execution.