Advances in Mixture-of-Experts Models

The field of Mixture-of-Experts (MoE) models is rapidly evolving, with a focus on improving scalability, efficiency, and performance. Recent developments have centered around designing novel training frameworks, routing mechanisms, and scaling laws to unlock the full potential of MoE models. Notably, researchers have been exploring ways to enable elastic inference-time expert utilization, allowing models to adapt to varying computational budgets. Additionally, there is a growing interest in understanding the internal mechanisms of MoE models, including expert-level behaviors and routing dynamics. These advancements have led to significant improvements in model performance, efficiency, and robustness. Some noteworthy papers in this regard include Elastic MoE, which introduces a novel training framework for scalable MoE models, and Dynamic Experts Search, which proposes a test-time scaling strategy for enhancing reasoning in MoE models. Other notable works, such as Towards a Comprehensive Scaling Law of Mixture-of-Experts and Bayesian Mixture-of-Experts, have made important contributions to our understanding of MoE models and their potential applications.

Sources

Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts

Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time

Towards a Comprehensive Scaling Law of Mixture-of-Experts

Bayesian Mixture-of-Experts: Towards Making LLMs Know What They Don't Know

Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms

LLaDA-MoE: A Sparse MoE Diffusion Language Model

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Built with on top of