The field of large language model inference is rapidly advancing, with a focus on improving efficiency and reducing latency. Recent developments have centered around optimizing pipeline parallelism, speculative decoding, and resource-aware partitioning. These advancements have led to significant speedups and improved performance, particularly in resource-constrained edge environments. Notably, innovations in algorithm-hardware co-design and vector-storage approaches have also shown great promise. Overall, the field is moving towards more scalable and efficient solutions for deploying large language models. Noteworthy papers include:
- Nesterov Method for Asynchronous Pipeline Parallel Optimization, which introduced a variant of Nesterov Accelerated Gradient for asynchronous optimization in pipeline parallelism.
- PipeSpec, which presented a framework for speculative decoding with hierarchical pipeline execution.
- RetroInfer, which proposed a vector-storage approach for scalable long-context LLM inference.
- AccLLM, which developed a comprehensive acceleration framework through algorithm-hardware co-design.
- Prism, which unleashed the full potential of GPU sharing for cost-efficient multi-LLM serving.