Optimizing Large Language Model Inference

The field of large language model inference is moving towards optimizing performance, energy efficiency, and scalability. Recent developments focus on improving cache arbitration, throttling, and memory system designs to address the substantial memory requirements of large language models. Researchers are also exploring alternative architectures, such as Coarse-Grained Reconfigurable Arrays (CGRAs), to provide a trade-off between energy efficiency and programmability. Additionally, there is a growing interest in disaggregating sampling from GPU inference and optimizing system-level analysis to identify performance bottlenecks. Noteworthy papers include LLaMCAT, which achieves an average speedup of 1.26x, and SIMPLE, which improves end-to-end throughput by up to 96%. Other notable works include Leveraging Recurrent Patterns in Graph Accelerators, which proposes a graph processing method to minimize memristor write operations, and AugServe, which presents an efficient inference framework to reduce queueing latency and enhance effective throughput for augmented LLM inference services.

Sources

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Efficient Kernel Mapping and Comprehensive System Evaluation of LLM Acceleration on a CGLA

SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving

Leveraging Recurrent Patterns in Graph Accelerators

RoMe: Row Granularity Access Memory System for Large Language Models

A Systematic Characterization of LLM Inference on GPUs

AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Built with on top of