Optimizing Inference on Mobile and Large-Scale Devices

The field of deep learning inference is moving towards optimizing performance on mobile and large-scale devices. Researchers are exploring ways to accelerate inference by leveraging the strengths of both CPUs and GPUs, as well as developing new architectures and techniques to reduce latency and improve efficiency. Notably, there is a growing interest in mixture-of-experts (MoE) models, which can alleviate memory bottlenecks and improve performance on large language models. However, MoE models introduce new challenges, such as expert parallelism and communication overhead, which are being addressed through innovative solutions like co-designed mapping and adaptive expert scheduling. Overall, the field is advancing rapidly, with a focus on developing practical and efficient solutions for real-world applications. Some noteworthy papers in this area include: Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution, which proposes a lightweight synchronization mechanism and machine learning models to predict execution times, achieving up to 1.89x speedup on mobile platforms. ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference, which presents a runtime system that combines adaptive expert prefetching and cache-aware routing, reducing model stall time to less than 0.1% of the baseline.

Optimizing Inference on Mobile and Large-Scale Devices

Sources