Efficient Scaling and Optimization of Mixture-of-Experts Models

The field of Mixture-of-Experts (MoE) models is moving towards more efficient and scalable architectures. Recent developments have focused on addressing the challenges of load imbalance, parameter redundancy, and communication overhead in MoE models. Researchers have proposed various innovative solutions, including dynamic expert clustering, structured compression, and elastic scaling methods. These advancements have led to significant improvements in model efficiency, throughput, and accuracy. Notably, some papers have introduced novel routing algorithms and architectures that balance load and preserve accuracy, while others have explored the theoretical foundations of MoE models and their optimization landscapes.

Some noteworthy papers in this area include: Breaking the MoE LLM Trilemma, which introduces a unified framework for dynamic expert clustering and structured compression. ElasticMoE, which achieves fine-grained and low-latency scaling for MoE models. TokenFlow, which enhances text streaming performance via preemptive request scheduling and proactive key-value cache management. SliceMoE, which routes embedding slices instead of tokens for fine-grained and balanced transformer scaling. Guided by the Experts, which provides convergence guarantees for joint training of soft-routed MoE models.

Sources

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models

TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling

Mixture of Many Zero-Compute Experts: A High-Rate Quantization Theory Perspective

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling

Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Built with on top of