Advances in Large Language Model Inference and Training

The field of large language models (LLMs) is rapidly evolving, with a focus on improving inference and training efficiency. Recent developments have centered around addressing the scalability challenges of traditional GPU-centric architectures, exploring novel memory orchestration techniques, and designing software-defined architectures for heterogeneous and legacy GPUs. Notable advancements include the proposal of disaggregated AI infrastructure platforms, state-preserving elastic tensor parallelism, and disk-aware KV cache offloading. These innovations have led to significant improvements in performance, efficiency, and cost-effectiveness.

Particularly noteworthy papers include FengHuang, which proposes a novel AI infrastructure design that achieves up to 93% local memory capacity reduction and 50% GPU compute savings. AnchorTP is also notable for its state-preserving elastic TP framework, which reduces Time to First Success by up to 11x and Time to Peak by up to 59%. Additionally, QUILL introduces a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work, achieving up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090.

Sources

FengHuang: Next-Generation Memory Orchestration for AI Inferencing

The Anatomy of a Triton Attention Kernel

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference

Striking the Right Balance between Compute and Copy: Improving LLM Inferencing Under Speculative Decoding

TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone

QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention

Preparation Meets Opportunity: Enhancing Data Preprocessing for ML Training With Seneca

10Cache: Heterogeneous Resource-Aware Tensor Caching and Migration for LLM Training

CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

On 10x Better Scalability: KV Stores Scale Up KV Cache