Edge-Assisted Large Language Model Inference

The field of large language model (LLM) inference is shifting towards edge-assisted approaches, which leverage consumer-grade GPUs at the edge to improve cost efficiency and reduce latency. This trend is driven by the need to deploy LLMs in resource-constrained environments, such as edge devices, while maintaining performance and accuracy. Researchers are exploring various techniques, including speculative decoding, early exits, and hetero-core parallelism, to accelerate LLM inference on edge devices. Noteworthy papers in this area include SpecEdge, which introduces a scalable edge-assisted serving framework that splits LLM workloads between edge and server GPUs, and Ghidorah, which presents a LLM inference system that leverages speculative decoding and hetero-core parallelism to achieve fast inference on end-user devices. Additionally, Clip4Retrofit enables real-time image labeling on edge devices via cross-architecture CLIP distillation, and SCORPIO introduces an SLO-oriented LLM serving system designed to maximize system goodput and SLO attainment for workloads with heterogeneous SLOs.

Sources

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation

Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Real-World Modeling of Computation Offloading for Neural Networks with Early Exits and Splits

SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism

Built with on top of