Efficient Model Optimization and Quantization

The field of model optimization and quantization is rapidly advancing, with a focus on improving the efficiency and accuracy of large language models and vision transformers. Recent developments have centered around novel quantization strategies, such as sparse model inversion and block rotation, which aim to reduce the computational cost and memory requirements of these models. Additionally, researchers have explored the use of stochastic rounding, mixed-precision training, and outlier-aware post-training quantization to further improve model performance. Noteworthy papers in this area include TetraJet-v2, which introduces an end-to-end 4-bit fully-quantized training method, and DartQuant, which proposes an efficient distribution-aware rotational calibration method for LLM quantization. Overall, these advancements have the potential to significantly accelerate the deployment of large-scale models in resource-constrained environments.

Sources

Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

Towards 1000-fold Electron Microscopy Image Compression for Connectomics via VQ-VAE with Transformer Prior

Outlier-Aware Post-Training Quantization for Image Super-Resolution

Training with Fewer Bits: Unlocking Edge LLMs Training with Stochastic Rounding

Efficiently Training A Flat Neural Network Before It has been Quantizated

EV-NVC: Efficient Variable bitrate Neural Video Compression

Fibbinary-Based Compression and Quantization for Efficient Neural Radio Receivers

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Using Span Queries to Optimize for Cache and Attention Locality

DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

Block Rotation is All You Need for MXFP4 Quantization