The field of large language models (LLMs) is rapidly evolving, with a focus on improving inference and training efficiency. Recent developments have centered around addressing the scalability challenges of traditional GPU-centric architectures, exploring novel memory orchestration techniques, and designing software-defined architectures for heterogeneous and legacy GPUs. Notable advancements include the proposal of disaggregated AI infrastructure platforms, state-preserving elastic tensor parallelism, and disk-aware KV cache offloading. These innovations have led to significant improvements in performance, efficiency, and cost-effectiveness.
Particularly noteworthy papers include FengHuang, which proposes a novel AI infrastructure design that achieves up to 93% local memory capacity reduction and 50% GPU compute savings. AnchorTP is also notable for its state-preserving elastic TP framework, which reduces Time to First Success by up to 11x and Time to Peak by up to 59%. Additionally, QUILL introduces a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work, achieving up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090.