The field of efficient computing for AI and edge devices is rapidly evolving, with a focus on optimizing performance, reducing latency, and improving energy efficiency. Recent developments have centered around innovative architectures, scheduling techniques, and compilation frameworks that enable faster and more accurate processing of complex workloads. Notably, researchers have explored the use of sparse and operator-aware hybrid scheduling, on-demand multi-task sparsity, and associative memory-based architectures to accelerate deep neural network inference. Additionally, there has been a push towards developing compiler tools and performance profiling techniques to better understand and optimize the behavior of accelerator compilers. Overall, these advancements have the potential to significantly improve the efficiency and effectiveness of AI and edge computing applications.
Noteworthy papers include: AutoSAGE, which presents an input-aware CUDA scheduler for sparse GNN aggregation; CAMformer, which proposes a novel accelerator that reinterprets attention as an associative memory operation; and IntAttention, which introduces a fully integer attention pipeline for efficient edge inference.