Accelerating Large Language Model Inference

The field of large language models (LLMs) is moving towards improving inference efficiency, with a focus on speculative decoding and parallelization techniques. Recent developments have introduced innovative methods to accelerate LLM inference, including adaptive layer parallelism, hybrid drafting, and rollback-aware branch parallelism. These approaches aim to reduce the decoding time and improve the overall efficiency of LLMs, making them more suitable for real-world applications. Noteworthy papers in this area include CLaSp, which proposes an in-context layer-skipping strategy for self-speculative decoding, and SpecBranch, which unlocks branch parallelism in speculative decoding. Other notable papers are Consultant Decoding, which achieves a 2.5-fold increase in inference speed, and AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters.

Sources

CLaSp: In-Context Layer Skip for Self-Speculative Decoding

Cross-Attention Speculative Decoding

Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Consultant Decoding: Yet Another Synergistic Mechanism

Reuse or Generate? Accelerating Code Editing via Edit-Oriented Speculative Decoding

POSS: Position Specialist Generates Better Draft for Speculative Decoding

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation

Built with on top of