Advances in Efficient Large Language Model Compression and Fine-Tuning

The field of large language models (LLMs) is moving towards more efficient compression and fine-tuning methods to reduce computational resources and memory requirements. Recent developments have focused on innovative quantization techniques, such as grouped lattice vector quantization and mixed-precision quantization, which achieve better trade-offs between model size and accuracy. Additionally, new fine-tuning methods like token-wise input-output projections and zero-latency fused low-rank adapters have shown promising results in reducing latency and improving performance. These advancements have the potential to enable the deployment of large models under stringent resource constraints and improve their overall efficiency. Noteworthy papers include Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression, which introduces a novel quantization framework, and zFLoRA: Zero-Latency Fused Low-Rank Adapters, which proposes a new adapter that introduces zero or negligible latency overhead. LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits is also notable for its mixed-precision post-training quantization method tailored to LoRA.

Sources

Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression

$\alpha$-LoRA: Effective Fine-Tuning via Base Model Rescaling

Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis

KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group

Switchable Token-Specific Codebook Quantization For Face Image Compression

Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation

CNOT Minimal Circuit Synthesis: A Reinforcement Learning Approach

Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving

Hybrid Quantum-Classical Recurrent Neural Networks

zFLoRA: Zero-Latency Fused Low-Rank Adapters

On the Impact of Weight Discretization in QUBO-Based SVM Training

1+1>2: A Synergistic Sparse and Low-Rank Compression Method for Large Language Models

LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits

STaMP: Sequence Transformation and Mixed Precision for Low-Precision Activation Quantization