The field of high-performance computing is witnessing significant advancements with the development of specialized hardware units, such as Sparse Tensor Cores and CUDA cores, which are being leveraged to accelerate compute-intensive operations. Research is focusing on optimizing these operations, including stencil computations, matrix multiplications, and sparse matrix multiplications, to fully exploit the potential of these specialized hardware units. Novel approaches, such as synergistic computation between different types of cores and massively parallel algorithms, are being explored to achieve superior performance. These advancements have far-reaching implications for various domains, including scientific computing, artificial intelligence, and graphics processing. Noteworthy papers include: SPTCStencil, which introduces a sparse computation paradigm to unlock the potential of Sparse Tensor Cores for stencil computations, achieving an average speedup of 5.46x. MCFuser, which presents a pioneering framework for generating high-performance fused kernels, resulting in up to a 5.9x speedup in kernel performance. Libra, which proposes a systematic approach to synergize CUDA and Tensor cores for sparse matrix multiplication, achieving an average speedup of 3.1x. TriADA, which introduces a massively parallel trilinear matrix-by-tensor multiply-add algorithm and device architecture, capable of performing trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. SparStencil, which retargets sparse TCUs for scientific stencil computations via structured sparsity transformation, achieving up to 7.1x speedup over state-of-the-art frameworks. AIRES, which accelerates out-of-core GCNs via algorithm-system co-design, achieving up to 1.8x lower latency in real-world graph processing benchmarks.
Accelerating Compute-Intensive Operations with Specialized Hardware
Sources
TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations
SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation
Localized evaluation and fast summation in the extrapolated regularization method for integrals in Stokes flow