Advances in Efficient Large Language Model Inference

The field of large language model (LLM) inference is rapidly advancing, with a focus on improving efficiency and reducing latency. Recent developments have led to the creation of novel architectures and techniques that enable faster and more accurate LLM inference. One key area of research is the optimization of LLM inference for specific hardware platforms, such as GPUs and CPUs. Another area of focus is the development of new algorithms and methods that can reduce the computational overhead of LLM inference, such as sparse activation and dynamic token halting. These advances have the potential to enable the widespread adoption of LLMs in a variety of applications, from natural language processing to recommender systems. Notable papers in this area include DCN^2, which introduces significant algorithmic improvements to the DCNv2 architecture, and QuickSilver, which enables semantic adaptivity at inference time without altering model weights or structure. Additionally, papers like Agent.xpu and LLM-Mesh demonstrate the effectiveness of optimized serving systems and serverless inference schemes for LLM workloads.

Sources

DCN^2: Interplay of Implicit Collision Weights and Explicit Cross Layers for Large-Scale Recommendation

A Survey of LLM Inference Systems

QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference

EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator

A New Family of Thread to Core Allocation Policies for an SMT ARM Processor

Research on Low-Latency Inference and Training Efficiency Optimization for Graph Neural Network and Large Language Model-Based Recommendation Systems

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency