The field of large language models (LLMs) is moving towards more efficient and scalable inference methods. Researchers are exploring ways to reduce the memory and computational costs of LLMs, making them more suitable for practical enterprise settings. This includes developing novel routing frameworks, optimizing model partitioning and device assignment, and proposing new benchmarking tools and metrics. Noteworthy papers in this area include Apriel-Nemotron-15B-Thinker, which achieves state-of-the-art performance while maintaining a smaller memory footprint, and Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies. Other notable works include Cost-Spectrum Contrastive Routing, which enables fast and cost-sensitive selection of experts, and X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures.