Efficient Large Language Model Inference and Optimization

The field of large language models (LLMs) is rapidly advancing, with a focus on improving inference efficiency, reducing memory overhead, and enhancing model performance. Recent developments have led to the proposal of novel architectures, such as Mixture-of-Channels and Homogeneous Expert Routing, which aim to reduce activation memory and improve knowledge transfer. Additionally, researchers have explored techniques like adaptive test-time scaling, dynamic self-consistency, and fast all-reduce communication to mitigate bottlenecks in distributed inference. Noteworthy papers include SLOFetch, which introduces a compressed-hierarchical instruction prefetching design for cloud microservices, and DuetServe, which presents a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. PuzzleMoE is also notable for its efficient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference.

Sources

SLOFetch: Compressed-Hierarchical Instruction Prefetching for Cloud Microservices

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference

Synera: Synergistic LLM Serving across Device and Cloud at Scale

From Attention to Disaggregation: Tracing the Evolution of LLM Inference

Motif 2 12.7B technical report

One Router to Route Them All: Homogeneous Expert Routing for Heterogeneous Graph Transformers

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Pricing Online LLM Services with Data-Calibrated Stackelberg Routing Game

Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLMs Pre-Training and Inference

Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Built with on top of