Efficient Model Compression for Edge Devices

The field of model compression is moving towards innovative approaches that prioritize reduction of multiplications and memory access, leading to significant improvements in energy efficiency and storage size. Recent developments focus on quantization-aware techniques, such as selectively applying different bit-widths to various components of depthwise-separable convolutional networks, and integrating partial sum quantization into the compression framework. These methods have been shown to achieve substantial reductions in energy cost and storage size while maintaining similar performance. Notable papers in this area include: PROM, which reduces energy cost by over an order of magnitude and storage size by 2.7x, and APSQ, which achieves nearly lossless compression on NLP and CV tasks while reducing energy costs by 28-87%. QStore is also noteworthy, as it enables lossless compression of models in multiple precisions, reducing storage footprint by up to 2.2x.

Sources

PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design

QStore: Quantization-Aware Compressed Model Storage

EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Built with on top of