Efficient Inference and Optimization for Large Language Models

The field of Large Language Models (LLMs) is moving towards more efficient inference and optimization techniques to address the challenges of long-context processing, memory consumption, and latency. Researchers are exploring innovative solutions to accelerate inference, improve data access efficiency, and reduce memory footprint. Notable advancements include the development of hotness-aware inference optimization systems, semantic-aware cache eviction frameworks, and digital in-ReRAM computation architectures. These innovations have shown significant improvements in speed, memory efficiency, and accuracy. Noteworthy papers include: HA-RAG, which achieves an average speedup of 2.10x and maximum speedup of 10.49x in Time-To-First-Token (TTFT) with negligible accuracy loss. SABlock, which consistently outperforms state-of-the-art baselines under the same memory budgets and reduces peak memory usage by 46.28%. DIRC-RAG, which achieves an on-chip non-volatile memory density of 5.18Mb/mm2 and a throughput of 131 TOPS. PureKV, which achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

Sources

HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement

SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

DIRC-RAG: Accelerating Edge RAG with Robust High-Density and High-Loading-Bandwidth Digital In-ReRAM Computation

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Built with on top of