Optimizing Large Language Models with Efficient Inference and Training Techniques

The field of large language models (LLMs) is moving towards optimizing inference and training techniques to improve efficiency and scalability. Researchers are exploring novel algorithms and architectures to reduce latency, cost, and memory fragmentation. Notable advancements include dynamic batching, sparse modeling, and collaborative inference between edge and cloud devices. These innovations enable more efficient and scalable LLM deployments, which is crucial for real-time applications and edge computing.

Some noteworthy papers in this area include: Leveraging Multi-Instance GPUs through moldable task scheduling, which proposes a 3-phase algorithm to optimize task execution on multi-instance GPUs. WGRAMMAR, a lightweight decoding engine that achieves up to 250x speedup over existing systems by leveraging prior knowledge of output structure. BucketServe, a bucket-based dynamic batching framework that optimizes LLM inference performance and achieves up to 3.58x improvement in throughput. BrownoutServe, a novel serving framework that adapts to dynamic workloads and optimizes inference efficiency for MoE-based LLMs, achieving up to 2.07x throughput improvement. PolyServe, a multi-SLO scheduling policy that maintains high SLO attainment while maximizing throughput, achieving 1.23x goodput gain compared to existing policies. Sandwich, a hardware-centric CPU-based LLM serving engine that optimizes prefill and decode phases separately, achieving an average 2.01x throughput improvement and 90% satisfactory time-to-first-token and time-per-output-token latencies.

Sources

Leveraging Multi-Instance GPUs through moldable task scheduling

Efficient Routing of Inference Requests across LLM Instances in Cloud-Edge Computing

Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving

BrownoutServe: SLO-Aware Inference Serving under Bursty Workloads for MoE-based LLMs

Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models

PolyServe: Efficient Multi-SLO Serving at Scale

Sandwich: Separating Prefill-Decode Compilation for Efficient CPU LLM Serving

Built with on top of