Edge-Assisted Large Language Model Inference

The field of large language model (LLM) inference is shifting towards edge-assisted approaches, which leverage consumer-grade GPUs at the edge to improve cost efficiency and reduce latency. This trend is driven by the need to deploy LLMs in resource-constrained environments, such as edge devices, while maintaining performance and accuracy. Researchers are exploring various techniques, including speculative decoding, early exits, and hetero-core parallelism, to accelerate LLM inference on edge devices. Noteworthy papers in this area include SpecEdge, which introduces a scalable edge-assisted serving framework that splits LLM workloads between edge and server GPUs, and Ghidorah, which presents a LLM inference system that leverages speculative decoding and hetero-core parallelism to achieve fast inference on end-user devices. Additionally, Clip4Retrofit enables real-time image labeling on edge devices via cross-architecture CLIP distillation, and SCORPIO introduces an SLO-oriented LLM serving system designed to maximize system goodput and SLO attainment for workloads with heterogeneous SLOs.

Edge-Assisted Large Language Model Inference

Sources