Efficient AI Model Deployment on Edge Devices

The field of artificial intelligence is moving towards optimizing model deployment on resource-constrained edge devices. Researchers are exploring innovative mechanisms to reduce model size and improve generalization capability, enabling the deployment of AI frameworks on devices with limited hardware and software support. One notable direction is the development of training-free frameworks that exploit temporal sparsity in attention patterns, allowing for efficient model inference on edge devices. Another area of focus is the use of knowledge distillation and early-exit mechanisms to achieve state-of-the-art performance on edge devices. Additionally, there is a growing interest in parameter-efficient fine-tuning methods that reduce computational costs and minimize the number of additional parameters used to adapt models to downstream tasks. Noteworthy papers in this area include: Knowledge Grafting, which introduces a novel mechanism for optimizing AI models for resource-constrained environments, achieving an 88.54% reduction in model size while improving generalization capability. DeltaLLM, which presents a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference on edge devices, achieving up to 60% sparsity during the prefilling stage with negligible accuracy drop. LoRA-PAR, which proposes a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task, and achieves state-of-the-art results with lower active parameter usage. RRTO, which introduces a high-performance transparent offloading system for model inference in mobile edge computing, achieving substantial reductions in per-inference latency and energy consumption compared to state-of-the-art transparent methods.

Sources

Knowledge Grafting: A Mechanism for Optimizing AI Model Deployment in Resource-Constrained Environments

DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

Lightweight Remote Sensing Scene Classification on Edge Devices via Knowledge Distillation and Early-exit

LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

RRTO: A High-Performance Transparent Offloading System for Model Inference in Mobile Edge Computing

On the Sustainability of AI Inferences in the Edge

Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

From LLMs to Edge: Parameter-Efficient Fine-Tuning on Edge Devices

Built with on top of