Efficient Large Language Model Inference and Compression

The field of large language models (LLMs) is moving towards more efficient inference and compression techniques. Recent developments have focused on improving the accuracy and speed of LLMs while reducing their memory and computational requirements. Notable advancements include the use of adaptive compression and activation checkpointing, small model assisted compensation for KV cache compression, and hierarchical verification of speculative beams. These innovations have led to significant improvements in LLM inference efficiency and accuracy.

Some noteworthy papers in this area include: Adacc, which proposes a novel memory management framework that combines adaptive compression and activation checkpointing to reduce the GPU memory footprint. SmallKV, which designs two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales to address the saliency shift problem and the marginal information over-compression problem. LieQ, which introduces a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. FlexQ, which proposes a novel post-training INT6 quantization framework combining algorithmic innovation with system-level optimizations. Fairy±i, which proposes a new paradigm for 2-bit complex LLMs with all parameters in {±1, ±i}, leveraging the representational advantages of the complex domain to boost full-precision accuracy.

Sources

Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization

Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

FlashCommunication V2: Bit Splitting and Spike Reserving for Any Bit Communication

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

CARD: Cache-Assisted Parallel Speculative Decoding for Efficient Large Language Model Inference

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

InfoQ: Mixed-Precision Quantization via Global Information Flow

Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Fairy$\pm i$: the First 2-bit Complex LLM with All Parameters in $\{\pm1, \pm i\}$

Built with on top of