The field of large language models is moving towards more efficient deployment and acceleration on edge devices, with a focus on reducing computational loads and energy consumption. Recent research has introduced novel architectures, such as BlockFFN and SLIM, which exploit sparsity and adaptive thresholding to achieve significant speedups and energy efficiency improvements. Additionally, new algorithms and data structures, like Wavelet-Enhanced Random Spectral Attention and compressed sparse formats, have been proposed to optimize performance and reduce storage overhead. Furthermore, the development of heterogeneous accelerators, such as hybrid systolic arrays and coarse-grained reconfigurable arrays, is enabling the deployment of large language models on low-power edge devices. Notable papers in this area include: BlockFFN, which achieves over 80% token-level sparsity and 70% chunk-level sparsity, and SLIM, which exploits sparsity through adaptive thresholding and achieves 13-18x throughput improvements over SSD-GPU systems. Wavelet-Enhanced Random Spectral Attention is also noteworthy, as it reduces computational loads by 73.4% and achieves best accuracy in all tests. Overall, these advancements are making large language models more practical and affordable for real-world applications.
Efficient Deployment and Acceleration of Large Language Models
Sources
Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)
BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity