The field of Edge AI and Large Language Model (LLM) inference is rapidly evolving, driven by the need for efficient, low-latency, and private processing of complex AI workloads. Recent developments have focused on improving the performance and scalability of edge devices, as well as reducing the computational requirements of LLMs. Notable advancements include the use of heterogeneous AI extensions, dynamic activation-aware weight pruning, and bandwidth management to enhance the performance of multi-core CPUs. Additionally, novel serverless data planes and federated pruning frameworks have been proposed to improve the efficiency and scalability of LLM inference. Furthermore, research has explored the potential of collaborative inference systems, which leverage both edge and cloud resources to address the challenges of LLM deployment. These systems aim to provide a balance between latency, cost, and privacy concerns. Some papers are particularly noteworthy, such as EdgeMM, which presents a multi-core CPU solution with heterogeneous AI extensions, achieving a 2.84x performance speedup compared to laptop GPUs. Palladium is also notable, as it introduces a DPU-enabled serverless data plane that reduces CPU burden and enables efficient zero-copy communication, resulting in a 20.9x improvement in RPS and a 21x reduction in latency.
Advancements in Edge AI and Efficient LLM Inference
Sources
EdgeMM: Multi-Core CPU with Heterogeneous AI-Extension and Activation-aware Weight Pruning for Multimodal LLMs at Edge
Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability