Advances in Long-Context Modeling for Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving long-context modeling capabilities. Recent research has highlighted the importance of extending context length, while also addressing the quadratic complexity of attention mechanisms. Innovative approaches, such as exploiting local KV cache asymmetry, using self-study to train cartridges, and applying mixed-precision quantization, have shown promising results in reducing memory usage and improving inference efficiency. Noteworthy papers in this area include Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs, which proposes a training-free compression framework that combines homogeneity-based key merging with lossless value compression, and Cartridges: Lightweight and general-purpose long context representations via self-study, which introduces a novel approach for training a smaller KV cache offline on each corpus. Additionally, papers such as KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache and DEAL: Disentangling Transformer Head Activations for LLM Steering have demonstrated the potential of mixed-precision quantization and causal-attribution frameworks for improving LLM performance.

Sources

Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Cartridges: Lightweight and general-purpose long context representations via self-study

KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

DEAL: Disentangling Transformer Head Activations for LLM Steering

Draft-based Approximate Inference for LLMs

Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

Latent Multi-Head Attention for Small Language Models

Attention Head Embeddings with Trainable Deep Kernels for Hallucination Detection in LLMs

Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking

Accelerating Diffusion Large Language Models with SlowFast: The Three Golden Principles

Precise Zero-Shot Pointwise Ranking with LLMs through Post-Aggregated Global Context Information

Built with on top of