Efficient Inference and Generation in Large Language Models

The field of large language models (LLMs) is moving towards more efficient inference and generation methods. Recent developments focus on accelerating token-by-token generation, reducing latency, and improving energy efficiency. Notable advancements include the integration of contextual information, speculative decoding, and dynamic hardware scheduling. These innovations enable more personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. Furthermore, research on semantic selection, knowledge distillation, and decoding-free sampling strategies is improving the efficiency and accuracy of LLMs. Noteworthy papers include: Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination, which introduces a mobile inference framework that improves generation speed and energy efficiency. TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs, which proposes an algorithm for universal speculative decoding that accommodates mismatched vocabularies. GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device, which develops a training-free inference system that reduces latency and peak memory usage. AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders, which improves the token acceptance rate without compromising generation quality. Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs, which designs an algorithm that provably competes with the best draft model in hindsight. Decoding-Free Sampling Strategies for LLM Marginalization, which investigates sampling strategies that are decoding-free and provide sufficiently accurate marginal estimates at a small fraction of the runtime cost.

Efficient Inference and Generation in Large Language Models

Sources