Efficient Large Language Model Inference

The field of large language models (LLMs) is moving towards more efficient and scalable inference methods. Researchers are exploring ways to reduce the memory and computational costs of LLMs, making them more suitable for practical enterprise settings. This includes developing novel routing frameworks, optimizing model partitioning and device assignment, and proposing new benchmarking tools and metrics. Noteworthy papers in this area include Apriel-Nemotron-15B-Thinker, which achieves state-of-the-art performance while maintaining a smaller memory footprint, and Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies. Other notable works include Cost-Spectrum Contrastive Routing, which enables fast and cost-sensitive selection of experts, and X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures.

Sources

Apriel-Nemotron-15B-Thinker

Inference performance evaluation for LLMs on edge devices with a novel benchmarking framework and metric

CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

Cost-Aware Contrastive Routing for LLMs

Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Maximum Score Routing For Mixture-of-Experts

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement

X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms

Batching-Aware Joint Model Onloading and Offloading for Hierarchical Multi-Task Inference

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Efficient Cloud-Edge-Device Query Execution Based on Collaborative Scan Operator

Built with on top of