Accelerating Compute-Intensive Operations with Specialized Hardware

The field of high-performance computing is witnessing significant advancements with the development of specialized hardware units, such as Sparse Tensor Cores and CUDA cores, which are being leveraged to accelerate compute-intensive operations. Research is focusing on optimizing these operations, including stencil computations, matrix multiplications, and sparse matrix multiplications, to fully exploit the potential of these specialized hardware units. Novel approaches, such as synergistic computation between different types of cores and massively parallel algorithms, are being explored to achieve superior performance. These advancements have far-reaching implications for various domains, including scientific computing, artificial intelligence, and graphics processing. Noteworthy papers include: SPTCStencil, which introduces a sparse computation paradigm to unlock the potential of Sparse Tensor Cores for stencil computations, achieving an average speedup of 5.46x. MCFuser, which presents a pioneering framework for generating high-performance fused kernels, resulting in up to a 5.9x speedup in kernel performance. Libra, which proposes a systematic approach to synergize CUDA and Tensor cores for sparse matrix multiplication, achieving an average speedup of 3.1x. TriADA, which introduces a massively parallel trilinear matrix-by-tensor multiply-add algorithm and device architecture, capable of performing trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. SparStencil, which retargets sparse TCUs for scientific stencil computations via structured sparsity transformation, achieving up to 7.1x speedup over state-of-the-art frameworks. AIRES, which accelerates out-of-core GCNs via algorithm-system co-design, achieving up to 1.8x lower latency in real-world graph processing benchmarks.

Sources

SPTCStencil: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swap

Exploring Commutative Matrix Multiplication Schemes via Flip Graphs

MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations

SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

Localized evaluation and fast summation in the extrapolated regularization method for integrals in Stokes flow

Anatomy of High-Performance Column-Pivoted QR Decomposition

PyTorch-based Geometric Learning with Non-CUDA Processing Units: Experiences from Intel Gaudi-v2 HPUs

AIRES: Accelerating Out-of-Core GCNs via Algorithm-System Co-Design

Hardware-Accelerated Algorithm for Complex Function Roots Density Graph Plotting

Built with on top of