The field of large language models (LLMs) is rapidly evolving, with a focus on improving efficiency and scalability. Recent developments have centered around optimizing cache management, reducing memory bottlenecks, and enhancing parallelism. Researchers are exploring innovative architectures, such as chiplet-based designs and silicon photonic interconnected chiplets, to accelerate LLM inference. Additionally, there is a growing emphasis on software-hardware co-design, with techniques like IR-drop mitigation and dynamic adjustment mechanisms being proposed to enhance energy efficiency and performance. Noteworthy papers in this area include: Synergistic Tensor and Pipeline Parallelism, which proposes a new scheduling method to reduce bubbles in tensor and pipeline parallelism, resulting in up to 12% and 16% improvements in training throughput for LLMs and MLLMs respectively. PICNIC: Silicon Photonic Interconnected Chiplets with Computational Network and In-memory Computing for LLM Inference Acceleration, which presents a 3D-stacked chiplets-based LLM inference accelerator achieving 3.95x speedup and 30x efficiency improvement over the Nvidia A100.
Advancements in Large Language Model Efficiency and Scalability
Sources
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
PICNIC: Silicon Photonic Interconnected Chiplets with Computational Network and In-memory Computing for LLM Inference Acceleration