The field of large language models is moving towards more efficient and scalable architectures. Recent developments focus on reducing the memory footprint and computational requirements of these models, while maintaining their performance. This is achieved through various techniques such as dynamic cross-layer knowledge sharing, query-agnostic key-value cache compression, and linearization frameworks. These innovations enable more efficient inference, reduce latency, and improve the overall scalability of large language models. Noteworthy papers include Krul, which introduces a multi-turn inference system with dynamic compression strategies, and Lizard, a linearization framework that transforms pretrained transformer-based models into flexible, subquadratic architectures. Additionally, Compactor presents a parameter-free, query-agnostic key-value compression strategy, and MIRAGE introduces a parameter remapping approach to optimize key-value cache usage.