Advancements in Mixture-of-Experts Models

The field of Mixture-of-Experts (MoE) models is rapidly advancing, with a focus on improving efficiency, scalability, and performance. Recent research has explored various techniques to optimize MoE models, including mixed-precision quantization, on-the-fly inference, and expert allocation methods. These innovations aim to address the challenges of deploying MoE models on resource-constrained devices and in multi-tenant environments. Notably, the importance of feedforward networks in transformer models has been highlighted, and the benefits of fine-grained experts in boosting expressivity have been demonstrated. Additionally, new architectures and frameworks, such as UMoE and PT-MoE, have been proposed to unify attention and feedforward networks, and to integrate mixture-of-experts into prompt tuning, respectively. Noteworthy papers include MxMoE, which introduces a mixed-precision optimization framework for MoE models, and FloE, which proposes an on-the-fly MoE inference system for memory-constrained GPUs. LoRA-SMoE is also notable for its sensitivity-driven expert allocation method, which enables efficient fine-tuning of MoE models. QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration is another significant work, as it addresses the challenge of serving multiple fine-tuned MoE-LLMs on a single GPU.

Sources

MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design

FloE: On-the-Fly MoE Inference

A Sensitivity-Driven Expert Allocation Method in LoRA-MoE for Efficient Fine-Tuning

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration

Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts

UMoE: Unifying Attention and FFN with Shared Experts

Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony

Efficient Mixed Precision Quantization in Graph Neural Networks

PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning

Built with on top of