Efficient Large Language Model Inference and Compression

The field of large language models (LLMs) is moving towards more efficient inference and compression techniques. Recent developments have focused on improving the accuracy and speed of LLMs while reducing their memory and computational requirements. Notable advancements include the use of adaptive compression and activation checkpointing, small model assisted compensation for KV cache compression, and hierarchical verification of speculative beams. These innovations have led to significant improvements in LLM inference efficiency and accuracy.

Some noteworthy papers in this area include: Adacc, which proposes a novel memory management framework that combines adaptive compression and activation checkpointing to reduce the GPU memory footprint. SmallKV, which designs two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales to address the saliency shift problem and the marginal information over-compression problem. LieQ, which introduces a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. FlexQ, which proposes a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. Fairy±i, which proposes a new paradigm for 2-bit complex LLMs with all parameters in {±1, ±i}, leveraging the representational advantages of the complex domain to boost full-precision accuracy.

Efficient Large Language Model Inference and Compression

Sources