Advances in Efficient Large Language Model Inference

The field of large language model inference is rapidly advancing, with a focus on improving efficiency and reducing latency. Recent developments have centered around optimizing pipeline parallelism, speculative decoding, and resource-aware partitioning. These advancements have led to significant speedups and improved performance, particularly in resource-constrained edge environments. Notably, innovations in algorithm-hardware co-design and vector-storage approaches have also shown great promise. Overall, the field is moving towards more scalable and efficient solutions for deploying large language models. Noteworthy papers include:

  • Nesterov Method for Asynchronous Pipeline Parallel Optimization, which introduced a variant of Nesterov Accelerated Gradient for asynchronous optimization in pipeline parallelism.
  • PipeSpec, which presented a framework for speculative decoding with hierarchical pipeline execution.
  • RetroInfer, which proposed a vector-storage approach for scalable long-context LLM inference.
  • AccLLM, which developed a comprehensive acceleration framework through algorithm-hardware co-design.
  • Prism, which unleashed the full potential of GPU sharing for cost-efficient multi-LLM serving.

Sources

Nesterov Method for Asynchronous Pipeline Parallel Optimization

PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Large Language Model Partitioning for Low-Latency Inference at the Edge

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Splitwiser: Efficient LM inference with constrained resources

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Pipelining Split Learning in Multi-hop Edge Networks

Built with on top of