Efficient and Effective Large Language Models

The field of large language models (LLMs) is moving towards more efficient and effective architectures, with a focus on improving decoding processes, reducing computational costs, and enhancing ranking and retrieval capabilities. Recent studies have explored the potential of decoder-only models, adaptive blockwise search strategies, and lightweight ranking frameworks to achieve state-of-the-art results while minimizing computational overhead. Additionally, there is a growing interest in developing more flexible and extensible LLM serving systems that can accommodate increasingly complex applications. Noteworthy papers in this area include: Language Ranker, which introduces a novel framework for reranking candidate responses using features extracted by the base model, achieving performance comparable to large-scale reward models with significantly reduced computational overhead. Do Stop Me Now, which proposes a simple yet effective method for detecting boilerplate responses after only a single generation step, enabling early termination or redirection to a smaller model and yielding significant savings in computational cost. E2Rank, which presents a unified framework for text embedding models to perform both high-quality retrieval and listwise reranking, achieving strong effectiveness with remarkable efficiency. AutoDeco, which enables truly end-to-end generation by learning to control its own decoding strategy, allowing the model to self-regulate its sampling strategy within a single forward pass and achieving performance comparable to an oracle-tuned baseline.

Efficient and Effective Large Language Models

Sources