The field of large language models (LLMs) is moving towards improving efficiency and scalability, particularly in resource-constrained environments. Researchers are exploring innovative methods to reduce memory usage, increase parallelism, and enhance performance. Notably, advances in speculative decoding, distributed inference, and mixture-of-experts (MoE) architectures are paving the way for faster and more cost-effective deployment of LLMs.
Several papers have made significant contributions to this area, including the development of device-aware inference engines, distributed inference platforms, and novel scheduling strategies.
Some particularly noteworthy papers include: SpecMemo, which achieves 96% of overall throughput from speculative decoding on MT-Bench with reduced generation-memory by 65% on single Nvidia Titan RTX. DistMLIP enables efficient MLIP parallelization through graph partitioning, allowing multi-device inference on flexible MLIP model architectures. APEX, a novel scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference, improving throughput by 84% - 96% on NVIDIA T4 and 11% - 89% on A10 GPUs. FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel, achieving up to 6x lower latency and 5.7x higher throughput compared to state-of-the-art baselines.