Advances in Efficient Deployment of Large Language Models

The field of large language models (LLMs) is moving towards more efficient deployment on edge devices, with a focus on reducing memory footprint, computational costs, and communication overhead. Researchers are exploring various techniques such as quantization, pruning, and knowledge distillation to achieve this goal. Notably, innovative approaches like semantic multiplexing and dynamic expert quantization are being proposed to improve the efficiency of LLMs in resource-constrained environments. These advancements have the potential to enable widespread adoption of LLMs in real-world applications. Noteworthy papers include SpecQuant, which achieves ultra-low-bit quantization for LLMs, and OTARo, which enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness. Additionally, Nemotron Elastic presents a framework for building reasoning-oriented LLMs that can embed multiple nested submodels within a single parent model, allowing for efficient deployment across different configurations and budgets.

Sources

SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices

Why Should the Server Do It All?: A Scalable, Versatile, and Model-Agnostic Framework for Server-Light DNN Inference over Massively Distributed Clients via Training-Free Intermediate Feature Compression

SpecQuant: Spectral Decomposition and Adaptive Truncation for Ultra-Low-Bit LLMs Quantization

Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks

A Structure-Agnostic Co-Tuning Framework for LLMs and SLMs in Cloud-Edge Systems

Uncertainty Makes It Stable: Curiosity-Driven Quantized Mixture-of-Experts

BitSnap: Checkpoint Sparsification and Quantization in LLM Training

SLMQuant:Benchmarking Small Language Model Quantization for Practical Deployment

MCAQ-YOLO: Morphological Complexity-Aware Quantization for Efficient Object Detection with Curriculum Learning

OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs

Semantic Multiplexing

MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Quant-Trim in Practice: Improved Cross-Platform Low-Bit Deployment on Edge NPUs

IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs