The field of large language models (LLMs) is rapidly advancing, with a focus on improving long-context modeling capabilities. Recent research has highlighted the importance of extending context length, while also addressing the quadratic complexity of attention mechanisms. Innovative approaches, such as exploiting local KV cache asymmetry, using self-study to train cartridges, and applying mixed-precision quantization, have shown promising results in reducing memory usage and improving inference efficiency. Noteworthy papers in this area include Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs, which proposes a training-free compression framework that combines homogeneity-based key merging with lossless value compression, and Cartridges: Lightweight and general-purpose long context representations via self-study, which introduces a novel approach for training a smaller KV cache offline on each corpus. Additionally, papers such as KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache and DEAL: Disentangling Transformer Head Activations for LLM Steering have demonstrated the potential of mixed-precision quantization and causal-attribution frameworks for improving LLM performance.