Advancements in GPU Architecture and Optimization

The field of GPU research is moving towards optimizing performance, improving efficiency, and increasing scalability. Recent developments focus on enhancing GPU utilization, reducing latency, and accelerating various workloads such as machine learning, linear algebra, and data search. Innovations in GPU architecture, memory management, and parallelization techniques are driving these advancements. Notable papers include: Decoupled Control Flow and Data Access in RISC-V GPGPUs, which introduces a hardware CF manager and decoupled memory streaming lanes to improve performance. Heimdall++ optimizes GPU utilization and pipeline parallelism for efficient single-pulse detection. Tangram accelerates serverless LLM loading through GPU memory reuse and affinity. Fantasy enables efficient large-scale vector search on GPU clusters with GPUDirect Async. Trinity disaggregates vector search from prefill-decode disaggregation in LLM serving. TokenScale introduces an autoscaling framework for disaggregated LLM serving with token velocity. These papers demonstrate significant improvements in performance, efficiency, and scalability, and are expected to have a substantial impact on the field.

Sources

Decoupled Control Flow and Data Access in RISC-V GPGPUs

Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Microbenchmarking NVIDIA's Blackwell Architecture: An in-depth Architectural Analysis

Fantasy: Efficient Large-scale Vector Search on GPU Clusters with GPUDirect Async

Trinity: Disaggregating Vector Search from Prefill-Decode Disaggregation in LLM Serving

Pushing Tensor Accelerators Beyond MatMul in a User-Schedulable Language

TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity

Built with on top of