The field of Mixture-of-Experts (MoE) models is rapidly advancing, with a focus on improving efficiency, scalability, and performance. Recent developments have centered around optimizing MoE architectures for deployment on edge devices, reducing memory access costs, and enhancing expert activation prediction. Notably, researchers are exploring novel methods for dynamic expert scheduling, importance-driven expert offloading, and hybrid adaptive parallelism to boost inference efficiency. Furthermore, studies have investigated the optimal sparsity of MoE models for reasoning tasks and the benefits of integrating memory layers into transformer blocks. These innovations have significant implications for the development of large language models and their applications in various domains.
Noteworthy papers include: GPT-OSS-20B, which demonstrates the deployment-centric advantages of MoE models, and MoE-Beyond, which introduces a learning-based expert activation predictor. UltraMemV2 is also notable for its redesigned memory-layer architecture, achieving performance parity with state-of-the-art MoE models. HAP presents a novel hybrid adaptive parallelism method for efficient MoE inference, consistently achieving comparable or superior performance to mainstream inference systems.