The field of large language models (LLMs) is rapidly advancing, with a focus on improving inference efficiency, reducing memory overhead, and enhancing model performance. Recent developments have led to the proposal of novel architectures, such as Mixture-of-Channels and Homogeneous Expert Routing, which aim to reduce activation memory and improve knowledge transfer. Additionally, researchers have explored techniques like adaptive test-time scaling, dynamic self-consistency, and fast all-reduce communication to mitigate bottlenecks in distributed inference. Noteworthy papers include SLOFetch, which introduces a compressed-hierarchical instruction prefetching design for cloud microservices, and DuetServe, which presents a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. PuzzleMoE is also notable for its efficient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference.