Accelerating Large Language Model Inference and Training

The field of large language models (LLMs) is rapidly advancing, with a focus on improving inference and training efficiency. Recent developments have centered around speculative decoding, parallelization, and adaptive strategies to accelerate LLMs. Notably, researchers have explored ways to reduce communication latency, optimize computation, and improve resource utilization. These innovations have led to significant speedups in LLM inference and training, enabling faster and more efficient processing of complex tasks.

Some noteworthy papers in this area include: Fast and Expressive Multi-Token Prediction with Probabilistic Circuits, which investigates the trade-off between expressiveness and latency in multi-token prediction. FarSkip-Collective, which modifies the architecture of modern models to enable overlapping of computation with communication, achieving accuracy on par with original models. Speculative Decoding in Decentralized LLM Inference, which presents a plug-and-play framework for decentralized inference that turns communication delay into useful computation. ParaDySe, a novel adaptive parallel strategy switching framework for dynamic sequences, which enables on-the-fly optimal strategy adoption according to the immediate input sequence. Beat the long tail: Distribution-Aware Speculative Decoding for RL Training, which proposes a distribution-aware speculative decoding framework that accelerates RL rollouts without altering model outputs. Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning, which presents a novel online context learning system that addresses performance bottlenecks in synchronous RL systems. Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones, which analyzes trajectories with step-level metrics and proposes a lightweight step-selection strategy that dynamically reallocates gradient updates to high-leverage steps. Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization, which proves that the optimal transport problem can be reduced to a convex optimization problem, allowing for optimal multi-draft speculative sampling. Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows, which proposes adaptive, RAM-efficient parallelization of chromosome-level bioinformatics workflows. Learning Tractable Distributions Of Language Model Continuations, which proposes a hybrid approach that pairs a base language model with a fixed tractable surrogate model to compute exact continuation probabilities. Fast LLM Post-training via Decoupled and Best-of-N Speculation, which achieves fast rollout with speculative decoding that deploys a fast path to accelerate unparallelizable generation. Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter, which proposes a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding.

Sources

Fast and Expressive Multi-Token Prediction with Probabilistic Circuits

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

What happens when nanochat meets DiLoCo?

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Reasoning in Diffusion Large Language Models is Concentrated in Dynamic Confusion Zones

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

Efficient Chromosome Parallelization for Precision Medicine Genomic Workflows

Learning Tractable Distributions Of Language Model Continuations

Fast LLM Post-training via Decoupled and Best-of-N Speculation

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Built with on top of