Advancements in Edge AI and Efficient LLM Inference

The field of Edge AI and Large Language Model (LLM) inference is rapidly evolving, driven by the need for efficient, low-latency, and private processing of complex AI workloads. Recent developments have focused on improving the performance and scalability of edge devices, as well as reducing the computational requirements of LLMs. Notable advancements include the use of heterogeneous AI extensions, dynamic activation-aware weight pruning, and bandwidth management to enhance the performance of multi-core CPUs. Additionally, novel serverless data planes and federated pruning frameworks have been proposed to improve the efficiency and scalability of LLM inference. Furthermore, research has explored the potential of collaborative inference systems, which leverage both edge and cloud resources to address the challenges of LLM deployment. These systems aim to provide a balance between latency, cost, and privacy concerns. Some papers are particularly noteworthy, such as EdgeMM, which presents a multi-core CPU solution with heterogeneous AI extensions, achieving a 2.84x performance speedup compared to laptop GPUs. Palladium is also notable, as it introduces a DPU-enabled serverless data plane that reduces CPU burden and enables efficient zero-copy communication, resulting in a 20.9x improvement in RPS and a 21x reduction in latency.

Sources

EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge

Palladium: A DPU-enabled Multi-Tenant Serverless Cloud over Zero-copy Multi-node RDMA Fabrics

SCAREY: Location-Aware Service Lifecycle Management

EdgeWisePersona: A Dataset for On-Device User Profiling from Natural Language Interactions

An Edge AI Solution for Space Object Detection

Exploring Federated Pruning for Large Language Models

Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability

Prime Collective Communications Library -- Technical Report

CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration

SkyMemory: A LEO Edge Cache for Transformer Inference Optimization and Scale Out

ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs

Design and Evaluation of a Microservices Cloud Framework for Online Travel Platforms

Harnessing Large Language Models Locally: Empirical Results and Implications for AI PC

An Efficient Private GPT Never Autoregressively Decodes

A Federated Splitting Framework for LLMs: Security, Efficiency, and Adaptability

Small Language Models in the Real World: Insights from Industrial Text Classification

Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Smaller, Smarter, Closer: The Edge of Collaborative Generative AI

Performance of Confidential Computing GPUs

Recursive Offloading for LLM Serving in Multi-tier Networks

Edge-First Language Model Inference: Models, Metrics, and Tradeoffs

Built with on top of