Efficient Large Language Model Inference on Edge Devices

The field of large language models (LLMs) is moving towards efficient inference on edge devices, driven by the need for real-time processing, improved privacy, and reduced latency. Recent developments focus on optimizing LLMs for deployment on resource-constrained devices, such as wearable devices and embedded systems. Researchers are exploring innovative techniques, including quantization, caching, and compression, to reduce the computational and memory demands of LLMs. Noteworthy papers in this area include TeLLMe, which presents a table-lookup-based ternary LLM accelerator for low-power edge FPGAs, and Kelle, which proposes a software-hardware co-design solution for deploying LLMs on eDRAM-based edge systems. Other notable works include AMS-Quant, which advances floating-point quantization exploration, and FourierCompress, which proposes a novel activation compression framework. These advancements have the potential to significantly improve the efficiency and accuracy of LLM inference on edge devices.

Sources

TeLLMe v2: An Efficient End-to-End Ternary LLM Prefill and Decode Accelerator with Table-Lookup Matmul on Edge FPGAs

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing

AMS-QUANT: Adaptive Mantissa Sharing for Floating-point Quantization

FourierCompress: Layer-Aware Spectral Activation Compression for Efficient and Accurate Collaborative LLM Inference

Mixed-Precision Quantization for Language Models: Techniques and Prospects

FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems

ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Built with on top of