Mixture-of-Experts Advancements

The field of Mixture-of-Experts (MoE) is experiencing significant developments, driven by innovations in expert selection, routing policies, and model compression. Researchers are exploring new methods to enhance the efficiency and effectiveness of MoE models, including hierarchical task-guided and context-responsive routing policies, as well as techniques to extract expert subnetworks from pretrained models. These advancements are leading to improved performance, reduced computational costs, and increased applicability of MoE models across various applications. Noteworthy papers in this area include:

  • THOR-MoE, which introduces a hierarchical task-guided and context-responsive routing policy, achieving superior performance on multi-domain translation and multilingual translation benchmarks.
  • Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks, which proposes a method to construct MoE variants from pretrained models, reducing computational costs and achieving impressive performance on ImageNet-1k recognition tasks.

Sources

On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation

Efficient Data Driven Mixture-of-Expert Extraction from Trained Networks

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Built with on top of