Efficient Deployment of Large Language Models

The field of large language models (LLMs) is shifting towards efficient deployment and compression techniques to enable their use in resource-constrained environments. Researchers are exploring various methods to reduce the memory and computational requirements of LLMs, including quantization, pruning, and knowledge distillation. A key challenge in this area is maintaining model performance while reducing resource demands. Recent developments have led to the proposal of novel frameworks and techniques, such as outlier-aware weight-only quantization and entropy-encoded weight compression, which show promising results in achieving this goal. Notable papers in this area include ICQuant, which leverages outlier statistics to design an efficient index coding scheme for outlier-aware weight-only quantization, and EntroLLM, which integrates mixed quantization with entropy coding to reduce storage overhead while maintaining model accuracy.

Sources

ICQuant: Index Coding enables Low-bit LLM Quantization

Position: Enough of Scaling LLMs! Lets Focus on Downscaling

Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities

An Empirical Study of Qwen3 Quantization

Quantitative Analysis of Performance Drop in DeepSeek Model Quantization

Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques

Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Radio: Rate-Distortion Optimization for Large Language Model Compression

RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free