Efficient Large Language Models

The field of large language models is moving towards more efficient and scalable architectures. Recent developments focus on reducing the memory footprint and computational requirements of these models, while maintaining their performance. This is achieved through various techniques such as dynamic cross-layer knowledge sharing, query-agnostic key-value cache compression, and linearization frameworks. These innovations enable more efficient inference, reduce latency, and improve the overall scalability of large language models. Noteworthy papers include Krul, which introduces a multi-turn inference system with dynamic compression strategies, and Lizard, a linearization framework that transforms pretrained transformer-based models into flexible, subquadratic architectures. Additionally, Compactor presents a parameter-free, query-agnostic key-value compression strategy, and MIRAGE introduces a parameter remapping approach to optimize key-value cache usage.

Sources

Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing

Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

Lizard: An Efficient Linearization Framework for Large Language Models

A Random Matrix Theory Perspective on the Learning Dynamics of Multi-head Latent Attention

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Fusing LLM Capabilities with Routing Data

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

BlockBPE: Parallel BPE Tokenization

IAM: Efficient Inference through Attention Mapping between Different-scale LLMs

Mixture of Raytraced Experts

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Synergy: End-to-end Concept Model