The field of Mixture-of-Experts (MoE) models is moving towards more efficient and scalable architectures. Recent developments have focused on addressing the challenges of load imbalance, parameter redundancy, and communication overhead in MoE models. Researchers have proposed various innovative solutions, including dynamic expert clustering, structured compression, and elastic scaling methods. These advancements have led to significant improvements in model efficiency, throughput, and accuracy. Notably, some papers have introduced novel routing algorithms and architectures that balance load and preserve accuracy, while others have explored the theoretical foundations of MoE models and their optimization landscapes.
Some noteworthy papers in this area include: Breaking the MoE LLM Trilemma, which introduces a unified framework for dynamic expert clustering and structured compression. ElasticMoE, which achieves fine-grained and low-latency scaling for MoE models. TokenFlow, which enhances text streaming performance via preemptive request scheduling and proactive key-value cache management. SliceMoE, which routes embedding slices instead of tokens for fine-grained and balanced transformer scaling. Guided by the Experts, which provides convergence guarantees for joint training of soft-routed MoE models.