Efficient Optimization of Large Language Models

The field of large language models is moving towards more efficient optimization techniques to mitigate the issues of resource demands and limited context windows. Researchers are exploring various methods such as pruning, quantization, and token dropping to improve the performance of these models. A key direction is the development of novel frameworks and strategies that can effectively balance efficiency, accuracy, and scalability across tasks and hardware configurations. Noteworthy papers in this regard include: Systematic Evaluation of Optimization Techniques for Long-Context Language Models, which provides a comprehensive analysis of optimization methods for long-context language models. EdgeInfinite-Instruct, which introduces a Segmented Supervised Fine-Tuning strategy tailored to long-sequence tasks and optimizes the model for efficient deployment on edge NPUs. HiPrune, which proposes a training-free and model-agnostic token pruning framework that achieves state-of-the-art pruning performance. VLMQ, which develops an importance-aware post-training quantization framework tailored for vision-language models. KVSink, which elucidates the underlying mechanisms of attention sinks during inference and introduces a plug-and-play method for effective preservation. VFlowOpt, which proposes a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism.

Sources

Systematic Evaluation of Optimization Techniques for Long-Context Language Models

EdgeInfinite-Instruct: Bridging SFT-Based Optimization and NPU-Level Efficiency for Edge Devices

HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Built with on top of