Efficient Decoding and Retrieval in Large Language Models

The field of large language models (LLMs) is moving towards more efficient decoding and retrieval methods. Recent developments have focused on reducing latency and improving throughput in retrieval-augmented generation (RAG) and test-time scaling applications. Notable advancements include the proposal of novel decoding frameworks that exploit the sparsity structure of RAG contexts and the development of dynamic speculative decoding methods that adapt to diverse workloads. Additionally, research has highlighted the importance of guided decoding in RAG systems and the need for domain-aware retrieval methods. Overall, these innovations aim to enhance the performance and scalability of LLMs in various applications. Noteworthy papers include: REFRAG, which proposes an efficient decoding framework for RAG applications, and DSDE, which introduces a dynamic speculative decoding method with KLD stability for real-world serving. Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval also presents a notable approach to optimizing retrieval in RAG systems.

Efficient Decoding and Retrieval in Large Language Models

Sources