Advancements in Large Language Model Efficiency and Scalability

The field of large language models (LLMs) is rapidly evolving, with a focus on improving efficiency and scalability. Recent developments have centered around optimizing cache management, reducing memory bottlenecks, and enhancing parallelism. Researchers are exploring innovative architectures, such as chiplet-based designs and silicon photonic interconnected chiplets, to accelerate LLM inference. Additionally, there is a growing emphasis on software-hardware co-design, with techniques like IR-drop mitigation and dynamic adjustment mechanisms being proposed to enhance energy efficiency and performance. Noteworthy papers in this area include: Synergistic Tensor and Pipeline Parallelism, which proposes a new scheduling method to reduce bubbles in tensor and pipeline parallelism, resulting in up to 12% and 16% improvements in training throughput for LLMs and MLLMs respectively. PICNIC: Silicon Photonic Interconnected Chiplets with Computational Network and In-memory Computing for LLM Inference Acceleration, which presents a 3D-stacked chiplets-based LLM inference accelerator achieving 3.95x speedup and 30x efficiency improvement over the Nvidia A100.

Sources

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Can machines think efficiently?

Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies

Synergistic Tensor and Pipeline Parallelism

AMD MI300X GPU Performance Analysis

RDMA Point-to-Point Communication for LLM Systems

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits

A CPU-Centric Perspective on Agentic AI

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Simulation-Driven Evaluation of Chiplet-Based Architectures Using VisualSim

KV Cache Transform Coding for Compact Storage in LLM Inference

Optimizing Attention on GPUs by Exploiting GPU Architectural NUMA Effects

Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies

PICNIC: Silicon Photonic Interconnected Chiplets with Computational Network and In-memory Computing for LLM Inference Acceleration

Implementation of transformer-based LLMs with large-scale optoelectronic neurons on a CMOS image sensor platform

AIM: Software and Hardware Co-design for Architecture-level IR-drop Mitigation in High-performance PIM