Advances in Scalable and Efficient Computing Systems

The field of computing systems is moving towards more scalable and efficient architectures, with a focus on improving performance and reducing latency. Recent developments have highlighted the importance of optimizing memory management, metadata handling, and data visibility in distributed storage systems. Innovations in middleware design, such as adaptive load balancing and cooperative caching, have shown promising results in mitigating metadata hotspots and improving system throughput. Additionally, advances in quantization techniques and platform-level optimization strategies have enabled more efficient inference of large language models on heterogeneous platforms. Noteworthy papers include: MIDAS, which reduces average queue lengths by 23% and mitigates worst-case hotspots by up to 80%, and Beluga, which achieves an 89.6% reduction in Time-To-First-Token and 7.35x throughput improvement in LLM inference. Other notable works, such as Kitty and Opt4GPTQ, have also demonstrated significant improvements in memory efficiency and inference performance.

Sources

MIDAS: Adaptive Proxy Middleware for Mitigating Metadata Hotspots in HPC I/O at Scale

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

SwitchDelta: Asynchronous Metadata Updating for Distributed Storage with In-Network Data Visibility

Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management

Handling of Memory Page Faults during Virtual-Address RDMA

Built with on top of