Advancements in Efficient Large Language Model Inference

The field of large language models (LLMs) is moving towards improving efficiency and scalability, particularly in resource-constrained environments. Researchers are exploring innovative methods to reduce memory usage, increase parallelism, and enhance performance. Notably, advances in speculative decoding, distributed inference, and mixture-of-experts (MoE) architectures are paving the way for faster and more cost-effective deployment of LLMs.

Several papers have made significant contributions to this area, including the development of device-aware inference engines, distributed inference platforms, and novel scheduling strategies.

Some particularly noteworthy papers include: SpecMemo, which achieves 96% of overall throughput from speculative decoding on MT-Bench with reduced generation-memory by 65% on single Nvidia Titan RTX. DistMLIP enables efficient MLIP parallelization through graph partitioning, allowing multi-device inference on flexible MLIP model architectures. APEX, a novel scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference, improving throughput by 84% - 96% on NVIDIA T4 and 11% - 89% on A10 GPUs. FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel, achieving up to 6x lower latency and 5.7x higher throughput compared to state-of-the-art baselines.

Sources

SpecMemo: Speculative Decoding is in Your Pocket

DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials

Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling

DiOMP-Offloading: Toward Portable Distributed Heterogeneous OpenMP

Scaling Fine-Grained MoE Beyond 50B Parameters: Empirical Evaluation and Practical Insights

Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

WANDER: An Explainable Decision-Support Framework for HPC

FlashDMoE: Fast Distributed MoE in a Single Kernel

Built with on top of