Advancements in Mixture-of-Experts Models

The field of Mixture-of-Experts (MoE) models is rapidly advancing, with a focus on improving efficiency, scalability, and performance. Recent developments have centered around optimizing MoE architectures for deployment on edge devices, reducing memory access costs, and enhancing expert activation prediction. Notably, researchers are exploring novel methods for dynamic expert scheduling, importance-driven expert offloading, and hybrid adaptive parallelism to boost inference efficiency. Furthermore, studies have investigated the optimal sparsity of MoE models for reasoning tasks and the benefits of integrating memory layers into transformer blocks. These innovations have significant implications for the development of large language models and their applications in various domains.

Noteworthy papers include: GPT-OSS-20B, which demonstrates the deployment-centric advantages of MoE models, and MoE-Beyond, which introduces a learning-based expert activation predictor. UltraMemV2 is also notable for its redesigned memory-layer architecture, achieving performance parity with state-of-the-art MoE models. HAP presents a novel hybrid adaptive parallelism method for efficient MoE inference, consistently achieving comparable or superior performance to mainstream inference systems.

Sources

GPT-OSS-20B: A Comprehensive Deployment-Centric Analysis of OpenAI's Open-Weight Mixture of Experts Model

MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

ExpertWeave: Efficiently Serving Expert-Specialized Fine-Tuned Adapters at Scale

TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Enabling MoE on the Edge via Importance-Driven Expert Scheduling

When recalling in-context, Transformers are not SSMs

HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference

ExpertSim: Fast Particle Detector Simulation Using Mixture-of-Generative-Experts

Built with on top of