Efficient Large Language Models through Mixture-of-Experts Architectures

The field of large language models is moving towards more efficient and scalable architectures, with a focus on Mixture-of-Experts (MoE) models. These models enable sparse parameter activation, reducing computational demands while scaling model size. Recent developments have introduced novel MoE architectures, such as those incorporating adjugate experts, hierarchical token deduplication, and expert swap techniques, which accelerate training and improve performance. Additionally, researchers are exploring the application of MoE models to multimodal tasks and vision-language models, demonstrating their effectiveness and efficiency. Noteworthy papers include EC2MoE, which proposes an adaptive framework for scalable MoE inference, and MoIIE, which introduces a mixture of intra- and inter-modality experts for large vision-language models. These advancements are expected to drive further innovation in the development of efficient and powerful large language models.

Sources

EC2MoE: Adaptive End-Cloud Pipeline Collaboration Enabling Scalable Mixture-of-Experts Inference

KnapFormer: An Online Load Balancer for Efficient Diffusion Transformers Training

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency

Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Progressive Depth Up-scaling via Optimal Transport

Motif 2.6B Technical Report

JustDense: Just using Dense instead of Sequence Mixer for Time Series analysis

CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge

Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference

HierMoE: Accelerating MoE Training with Hierarchical Token Deduplication and Expert Swap

$\mu$-Parametrization for Mixture of Experts

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Natively Trainable Sparse Attention for Hierarchical Point Cloud Datasets