Efficient Deployment of Large Language Models

The field of large language models (LLMs) is moving towards efficient deployment, with a focus on reducing model size and inference latency without compromising performance. Researchers are exploring various techniques, including post-training pruning, quantization, and knowledge distillation, to achieve this goal. Notably, the development of novel pruning methods, such as those that leverage weight update magnitudes and activation patterns, has shown promising results. Additionally, the use of quantization has been found to have a nuanced impact on model bias, highlighting the need for careful consideration of ethical implications. Overall, the field is advancing towards more efficient and scalable LLMs, with potential applications in resource-constrained environments. Noteworthy papers include Z-Pruner, which introduces a novel post-training pruning method, and How Quantization Shapes Bias in Large Language Models, which provides a comprehensive evaluation of the impact of quantization on model bias.

Sources

Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining

How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair

ARSP: Automated Repair of Verilog Designs via Semantic Partitioning

WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling

Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

Interpreting the Effects of Quantization on LLMs

Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions

How Quantization Shapes Bias in Large Language Models

Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models

LLM as an Execution Estimator: Recovering Missing Dependency for Practical Time-travelling Debugging

Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs

Quantization Robustness to Input Degradations for Object Detection

Spatio-Temporal Pruning for Compressed Spiking Large Language Models

Beacon: Post-Training Quantization with Integrated Grid Selection

The Uneven Impact of Post-Training Quantization in Machine Translation

ConfLogger: Enhance Systems' Configuration Diagnosability through Configuration Logging