The field of large language models (LLMs) is moving towards optimizing inference and training techniques to improve efficiency and scalability. Researchers are exploring novel algorithms and architectures to reduce latency, cost, and memory fragmentation. Notable advancements include dynamic batching, sparse modeling, and collaborative inference between edge and cloud devices. These innovations enable more efficient and scalable LLM deployments, which is crucial for real-time applications and edge computing.
Some noteworthy papers in this area include: Leveraging Multi-Instance GPUs through moldable task scheduling, which proposes a 3-phase algorithm to optimize task execution on multi-instance GPUs. WGRAMMAR, a lightweight decoding engine that achieves up to 250x speedup over existing systems by leveraging prior knowledge of output structure. BucketServe, a bucket-based dynamic batching framework that optimizes LLM inference performance and achieves up to 3.58x improvement in throughput. BrownoutServe, a novel serving framework that adapts to dynamic workloads and optimizes inference efficiency for MoE-based LLMs, achieving up to 2.07x throughput improvement. PolyServe, a multi-SLO scheduling policy that maintains high SLO attainment while maximizing throughput, achieving 1.23x goodput gain compared to existing policies. Sandwich, a hardware-centric CPU-based LLM serving engine that optimizes prefill and decode phases separately, achieving an average 2.01x throughput improvement and 90% satisfactory time-to-first-token and time-per-output-token latencies.