Advances in Efficient Deployment of Large Language Models

The field of large language models (LLMs) is moving towards more efficient deployment methods, focusing on reducing memory and computational requirements. Recent research has explored various techniques, including quantization, pruning, and knowledge distillation, to enable the deployment of LLMs on resource-constrained devices. Notable papers in this area include AnyBCQ, which presents a hardware-friendly multi-precision extension of Binary-Coded Quantization, and ADiP, which proposes an adaptive-precision systolic array architecture for efficient matrix multiplication acceleration. Other noteworthy papers are Bhasha-Rupantarika, which introduces a light and efficient multilingual translation system, and XQuant, which achieves ultra-low bit KV cache quantization with cross-layer compression. These innovative approaches advance the field by providing practical solutions for efficient LLM deployment, enabling widespread adoption in various applications.

Sources

AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

ADiP: Adaptive Precision Systolic Array for Matrix Multiplication Acceleration

Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation

Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models

Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction

ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

Efficient In-Memory Acceleration of Sparse Block Diagonal LLMs

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Rescaling-Aware Training for Efficient Deployment of Deep Learning Models on Full-Integer Hardware

High-Parallel FPGA-Based Discrete Simulated Bifurcation for Large-Scale Optimization

SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models

CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Real-Time Crowd Counting for Embedded Systems with Lightweight Architecture

Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Energy-Efficient FPGA Framework for Non-Quantized Convolutional Neural Networks

F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

Accelerated Feature Detectors for Visual SLAM: A Comparative Study of FPGA vs GPU

ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

BitNet Distillation

DIAMOND: Systolic Array Acceleration of Sparse Matrix Multiplication for Quantum Simulation

Computing-In-Memory Aware Model Adaption For Edge Devices

Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

FraQAT: Quantization Aware Training with Fractional bits