Advancements in Efficient Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving efficiency, scalability, and performance. Recent developments have led to the creation of novel architectures, such as the Mixture of Experts (MoE) framework, which enables dynamic routing and sparse activation to reduce computational costs. Additionally, techniques like quantization, low-rank gradient projection, and adaptive rank and bitwidth allocation have been proposed to further optimize LLMs. These advancements have significant implications for real-world applications, allowing for more efficient deployment and improved performance.

Noteworthy papers in this area include the proposal of Similarity-Aware MoE, which leverages token similarities to guide routing and enhance model robustness. The introduction of MoEQuant, a novel quantization framework tailored for MoE LLMs, has also shown substantial performance gains. Furthermore, the development of ORXE, a modular framework for dynamically configurable efficiency, has demonstrated superior performance and flexibility without requiring complex metamodel training. These innovative approaches are advancing the field and paving the way for more efficient and effective LLMs.

Sources

Improving Routing in Sparse Mixture of Experts with Graph of Tokens

Togedule: Scheduling Meetings with Large Language Models and Adaptive Representations of Group Availability

BalancEdit: Dynamically Balancing the Generality-Locality Trade-off in Multi-modal Model Editing

COSMOS: Predictable and Cost-Effective Adaptation of LLMs

MoxE: Mixture of xLSTM Experts with Entropy-Aware Routing for Efficient Language Modeling

Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients

Faster MoE LLM Inference for Extremely Large Models

LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth

MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

ORXE: Orchestrating Experts for Dynamically Configurable Efficiency

Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization